Update data-tidy.Rmd (#1024)
* Update data-tidy.Rmd Fixed a few typos and propose some grammar corrections. * Update data-tidy.Rmd Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
parent
ede8ec5c1b
commit
b6809a8a9c
|
@ -20,7 +20,7 @@ In this chapter, you'll first learn the definition of tidy data and see it appli
|
|||
Then we'll dive into the main tool you'll use for tidying data: pivoting.
|
||||
Pivoting allows you to change the form of your data, without changing any of the values.
|
||||
We'll finish up with a discussion of usefully untidy data, and how you can create it if needed.
|
||||
If you particularly enjoy this chapter and learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
|
||||
If you particularly enjoy this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -40,7 +40,7 @@ From this chapter on, we'll suppress the loading message from `library(tidyverse
|
|||
|
||||
You can represent the same underlying data in multiple ways.
|
||||
The example below shows the same data organised in four different ways.
|
||||
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
|
||||
Each dataset shows the same values of four variables: *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
|
||||
|
||||
<!-- TODO redraw as tables -->
|
||||
|
||||
|
@ -70,11 +70,11 @@ Figure \@ref(fig:tidy-structure) shows the rules visually.
|
|||
#| echo: FALSE
|
||||
#| out.width: NULL
|
||||
#| fig.cap: >
|
||||
#| Following three rules makes a dataset tidy: variables are columns,
|
||||
#| The following three rules make a dataset tidy: variables are columns,
|
||||
#| observations are rows, and values are cells.
|
||||
#| fig.alt: >
|
||||
#| Three panels, each representing a tidy data frame. The first panel
|
||||
#| shows that each variable is column. The second panel shows that each
|
||||
#| shows that each variable is a column. The second panel shows that each
|
||||
#| observation is a row. The third panel shows that each value is
|
||||
#| a cell.
|
||||
|
||||
|
@ -130,7 +130,7 @@ ggplot(table1, aes(year, cases)) +
|
|||
|
||||
1. Using prose, describe how the variables and observations are organised in each of the sample tables.
|
||||
|
||||
2. Sketch out the process you'd use to the `rate` for `table2`, and `table4a` + `table4b`.
|
||||
2. Sketch out the process you'd use to calculate the `rate` for `table2` and `table4a` + `table4b`.
|
||||
You will need to perform four operations:
|
||||
|
||||
a. Extract the number of TB cases per country per year.
|
||||
|
@ -174,14 +174,14 @@ billboard
|
|||
```
|
||||
|
||||
In this dataset, each observation is a song.
|
||||
The first three columns, `artist`, `track`, and `date.entered`, are variables that describe the song.
|
||||
The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
|
||||
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
|
||||
Here the column names are one variable (the `week`) and the cell values are another (the `rank`).
|
||||
Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).
|
||||
|
||||
To tidy this data we'll use `pivot_longer()`.
|
||||
To tidy this data, we'll use `pivot_longer()`.
|
||||
After the data, there are three key arguments:
|
||||
|
||||
- `cols` specifies which which columns need to be pivoted, i.e. which columns aren't variables. This argument uses the same syntax as `select()` so here we could use `!c(artist, track, date.entered)` or `starts_with("wk")`
|
||||
- `cols` specifies which columns need to be pivoted, i.e. which columns aren't variables. This argument uses the same syntax as `select()` so here we could use `!c(artist, track, date.entered)` or `starts_with("wk")`.
|
||||
- `names_to` names of the variable stored in the column names, here `"week"`.
|
||||
- `values_to` names the variable stored in the cell values, here `"rank"`.
|
||||
|
||||
|
@ -199,7 +199,7 @@ billboard |>
|
|||
What happens if a song is in the top 100 for less than 76 weeks?
|
||||
Take 2 Pac's "Baby Don't Cry", for example.
|
||||
The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
|
||||
These `NA`s don't really represent unknown observations; they're forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer` to get rid of the by setting `values_drop_na = TRUE`:
|
||||
These `NA`s don't really represent unknown observations; they're forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer` to get rid of them by setting `values_drop_na = TRUE`:
|
||||
|
||||
[^data-tidy-1]: We'll come back to this idea in Chapter \@ref(missing-values).
|
||||
|
||||
|
@ -239,7 +239,7 @@ The code is shown below and the result is Figure \@ref(fig:billboard-ranks).
|
|||
```{r}
|
||||
#| label: billboard-ranks
|
||||
#| fig.cap: >
|
||||
#| A line plot showing the how the rank of a song changes over time.
|
||||
#| A line plot showing how the rank of a song changes over time.
|
||||
#| fig.alt: >
|
||||
#| A line plot with week on the x-axis and rank on the y-axis, where
|
||||
#| each line represents a song. Most songs appear to start at a high rank,
|
||||
|
@ -255,7 +255,7 @@ billboard_tidy |>
|
|||
|
||||
### How does pivoting work?
|
||||
|
||||
Now that you've seen what pivoting can do for you, it's worth taking a little time to gain some intuition for it does to the data.
|
||||
Now that you've seen what pivoting can do for you, it's worth taking a little time to gain some intuition about what it does to the data.
|
||||
Let's start with a very simple dataset to make it easier to see what's happening:
|
||||
|
||||
```{r}
|
||||
|
@ -288,7 +288,7 @@ Columns that are already variables need to be repeated, once for each column in
|
|||
#| echo: FALSE
|
||||
#| out.width: NULL
|
||||
#| fig.alt: >
|
||||
#| A diagram showing showing how `pivot_longer()` transforms a simple
|
||||
#| A diagram showing how `pivot_longer()` transforms a simple
|
||||
#| dataset, using color to highlight how the values in the `var` column
|
||||
#| ("A", "B", "C") are each repeated twice in the output because there are
|
||||
#| two columns being pivotted ("col1" and "col2").
|
||||
|
@ -307,9 +307,9 @@ They need to be repeated once for each row in the original dataset.
|
|||
#| echo: false
|
||||
#| out.width: NULL
|
||||
#| fig.alt: >
|
||||
#| A diagram showing showing how `pivot_longer()` transforms a simple
|
||||
#| A diagram showing how `pivot_longer()` transforms a simple
|
||||
#| data set, using color to highlight how column names ("col1" and
|
||||
#| "col2") become the values in a new name `var` column. They are repeated
|
||||
#| "col2") become the values in a new `var` column. They are repeated
|
||||
#| three times because there were three rows in the input.
|
||||
#| fig.cap: >
|
||||
#| The column names of pivoted columns become a new column.
|
||||
|
@ -317,8 +317,8 @@ They need to be repeated once for each row in the original dataset.
|
|||
knitr::include_graphics("diagrams/tidy-data/column-names.png", dpi = 270)
|
||||
```
|
||||
|
||||
The cell values also become values in a new variable, with name given by `values_to`.
|
||||
The are unwound row by row.
|
||||
The cell values also become values in a new variable, with a name given by `values_to`.
|
||||
They are unwound row by row.
|
||||
Figure \@ref(fig:pivot-values) illustrates the process.
|
||||
|
||||
```{r}
|
||||
|
@ -326,13 +326,13 @@ Figure \@ref(fig:pivot-values) illustrates the process.
|
|||
#| echo: false
|
||||
#| out.width: NULL
|
||||
#| fig.alt: >
|
||||
#| A diagram showing showing how `pivot_longer()` transforms data,
|
||||
#| A diagram showing how `pivot_longer()` transforms data,
|
||||
#| using color to highlight how the cell values (the numbers 1 to 6)
|
||||
#| become value in a new `value` column. They are unwound row-by-row,
|
||||
#| so the originals rows (1,2), then (3,4), then (5,6), become a column
|
||||
#| become the values in a new `value` column. They are unwound row-by-row,
|
||||
#| so the original rows (1,2), then (3,4), then (5,6), become a column
|
||||
#| running from 1 to 6.
|
||||
#| fig.cap: >
|
||||
#| The number of values are preserved (not repeated), but unwound
|
||||
#| The number of values is preserved (not repeated), but unwound
|
||||
#| row-by-row.
|
||||
|
||||
knitr::include_graphics("diagrams/tidy-data/cell-values.png", dpi = 270)
|
||||
|
@ -401,8 +401,8 @@ household
|
|||
```
|
||||
|
||||
This dataset contains data about five families, with the names and dates of birth of up to two children.
|
||||
The new challenge in this dataset is that the column names contain the name of two variables (`dob`, `name)` and the values of another (`child,` with values 1 and 2).
|
||||
To solve this problem we again we need to supply a vector to `names_to` but this time we use the special `".value"` sentinel.
|
||||
The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name)` and the values of another (`child,` with values 1 and 2).
|
||||
To solve this problem we again need to supply a vector to `names_to` but this time we use the special `".value"` sentinel.
|
||||
This overrides the usual `values_to` argument to use the first component of the pivoted column name as a variable name in the output.
|
||||
|
||||
```{r}
|
||||
|
@ -476,7 +476,7 @@ cms_patient_experience |>
|
|||
|
||||
The output doesn't look quite right; we still seem to have multiple rows for each organization.
|
||||
That's because, by default, `pivot_wider()` will attempt to preserve all the existing columns including `measure_title` which has six distinct observations for each organisations.
|
||||
To fix this problem we need to tell `pivot_wider()` which columns identify each row; in this case that's the variables starting with `org`:
|
||||
To fix this problem we need to tell `pivot_wider()` which columns identify each row; in this case those are the variables starting with `org`:
|
||||
|
||||
```{r}
|
||||
cms_patient_experience |>
|
||||
|
@ -491,7 +491,7 @@ This gives us the output that we're looking for.
|
|||
|
||||
### How does `pivot_wider()` work?
|
||||
|
||||
To understand how `pivot_wider()` works, lets again start with a very simple dataset:
|
||||
To understand how `pivot_wider()` works, let's again start with a very simple dataset:
|
||||
|
||||
```{r}
|
||||
df <- tribble(
|
||||
|
@ -516,7 +516,7 @@ df |>
|
|||
|
||||
The connection between the position of the row in the input and the cell in the output is weaker than in `pivot_longer()` because the rows and columns in the output are primarily determined by the values of variables, not their locations.
|
||||
|
||||
To begin the process `pivot_wider()` needs to first figure what will go in the rows and columns.
|
||||
To begin the process `pivot_wider()` needs to first figure out what will go in the rows and columns.
|
||||
Finding the column names is easy: it's just the values of `name`.
|
||||
|
||||
```{r}
|
||||
|
@ -524,7 +524,7 @@ df |>
|
|||
distinct(name)
|
||||
```
|
||||
|
||||
By default, the rows in the output are formed by all variables that aren't going into the names or the values.
|
||||
By default, the rows in the output are formed by all the variables that aren't going into the names or values.
|
||||
These are called the `id_cols`.
|
||||
|
||||
```{r}
|
||||
|
@ -546,7 +546,7 @@ It then fills in all the missing values using the data in the input.
|
|||
In this case, not every cell in the output has corresponding value in the input as there's no entry for id "B" and name "z", so that cell remains missing.
|
||||
We'll come back to this idea that `pivot_wider()` can "make" missing values in Chapter \@ref(missing-values).
|
||||
|
||||
You might also wonder what happens if there are are multiple rows in the input that correspond to one cell in the output.
|
||||
You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output.
|
||||
The example below has two rows that correspond to id "A" and name "x":
|
||||
|
||||
```{r}
|
||||
|
@ -569,7 +569,7 @@ df |> pivot_wider(
|
|||
)
|
||||
```
|
||||
|
||||
Since you don't know how to work this sort of data yet, you'll want to follow the hint in the warning to figure out where the problem is:
|
||||
Since you don't know how to work with this sort of data yet, you'll want to follow the hint in the warning to figure out where the problem is:
|
||||
|
||||
```{r}
|
||||
df %>%
|
||||
|
@ -609,20 +609,20 @@ diamonds |>
|
|||
)
|
||||
```
|
||||
|
||||
This display also makes it easily compare in two directions, horizontally and vertically, much like `facet_grid()`.
|
||||
This display also makes it easy to compare in two directions, horizontally and vertically, much like `facet_grid()`.
|
||||
|
||||
`pivot_wider()` can be great for quickly sketching out a table.
|
||||
But for real presentation tables, we highly suggest learning a package like [gt](https://gt.rstudio.com).
|
||||
gt is similar ggplot2 in that it provides an extremely powerful grammar for laying out tables.
|
||||
gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables.
|
||||
It takes some work to learn but the payoff is the ability to make just about any table you can imagine.
|
||||
|
||||
### Multivariate statistics
|
||||
|
||||
Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is time point, or a location, or gene, or species, but definitely not a variable.
|
||||
Sometimes these formats have substantial performance or space advantages or sometimes they're just necessary to get closer to the underlying matrix mathematics.
|
||||
Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable.
|
||||
Sometimes these formats have substantial performance or space advantages, or sometimes they're just necessary to get closer to the underlying matrix mathematics.
|
||||
|
||||
We're not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need.
|
||||
For example, lets imagine you wanted to cluster the gapminder data to find countries that had similar progression of `gdpPercap` over time.
|
||||
For example, let's imagine you wanted to cluster the gapminder data to find countries that had similar progression of `gdpPercap` over time.
|
||||
To do this, we need one row for each country and one column for each year:
|
||||
|
||||
```{r}
|
||||
|
@ -639,7 +639,7 @@ col_year
|
|||
```
|
||||
|
||||
`pivot_wider()` produces a tibble where each row is labelled by the `country` variable.
|
||||
But most classic statistical algorithm don't want the identifier as an explicit variable; they want as a **row name**.
|
||||
But most classic statistical algorithms don't want the identifier as an explicit variable; they want as a **row name**.
|
||||
We can turn the `country` variable into row names with `column_to_rowname()`:
|
||||
|
||||
```{r}
|
||||
|
@ -693,7 +693,7 @@ This is partly a reflection of our definition of tidy data, where we said tidy d
|
|||
It's totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.
|
||||
|
||||
So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
|
||||
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is columns; for those with a variable of number of values (like sums or means) it's usually easier in rows.
|
||||
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is in columns; for those with a variable number of values (like sums or means) it's usually easier in rows.
|
||||
Don't be afraid to untidy, transform, and re-tidy if needed.
|
||||
|
||||
Let's explore this idea by looking at `cms_patient_care`, which has a similar structure to `cms_patient_experience`:
|
||||
|
@ -702,11 +702,11 @@ Let's explore this idea by looking at `cms_patient_care`, which has a similar st
|
|||
cms_patient_care
|
||||
```
|
||||
|
||||
It contains information about 9 measures (`beliefs_addressed`, `composite_process`, `dyspena_treatment`, ...) on 14 different facilities (identified by `ccn` with name given by `facility_name`).
|
||||
It contains information about 9 measures (`beliefs_addressed`, `composite_process`, `dyspena_treatment`, ...) on 14 different facilities (identified by `ccn` with a name given by `facility_name`).
|
||||
Compared to `cms_patient_experience`, however, each measurement is recorded in two rows with a `score`, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to.
|
||||
Depending on what you want to do next you might finding any of the following three structures useful:
|
||||
Depending on what you want to do next, you may find any of the following three structures useful:
|
||||
|
||||
- If you want to compute the number of patients that answered yes to the to question, you might pivot `type` into the columns:
|
||||
- If you want to compute the number of patients that answered yes to the question, you may pivot `type` into the columns:
|
||||
|
||||
```{r}
|
||||
cms_patient_care |>
|
||||
|
@ -719,7 +719,7 @@ Depending on what you want to do next you might finding any of the following thr
|
|||
)
|
||||
```
|
||||
|
||||
- If you wanted to display the distribution of each metric, you might keep it as is so you could facet by `measure_abbr`.
|
||||
- If you want to display the distribution of each metric, you may keep it as is so you could facet by `measure_abbr`.
|
||||
|
||||
```{r}
|
||||
#| fig.show: "hide"
|
||||
|
@ -731,7 +731,7 @@ Depending on what you want to do next you might finding any of the following thr
|
|||
facet_wrap(vars(measure_abbr))
|
||||
```
|
||||
|
||||
- If you wanted to explore how different metrics are related, you might put the measure name names in the columns so you could compare them in scatterplots.
|
||||
- If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.
|
||||
|
||||
```{r}
|
||||
#| fig.show: "hide"
|
||||
|
|
Loading…
Reference in New Issue