parent
8101753650
commit
0393514ae8
20
tidy.Rmd
20
tidy.Rmd
|
@ -8,7 +8,7 @@ title: Tidy data
|
|||
> "Tidy datasets are all alike but every messy dataset is messy in its
|
||||
> own way." – Hadley Wickham
|
||||
|
||||
Data science, at its heart, is a computer programming exercise. Data scientists use computers to store, transform, visualize, and model their data. Each computer program will expect your data to be organized in a predetermined way, which may vary from program to program. To be an effective data scientist, you will need to be able to reorganize your data to match the format required by your program.
|
||||
Data science, at its heart, is a computer programming exercise. Data scientists use computers to store, transform, visualize, and model their data. Each computer program will expect your data to be organized in a predetermined way, which may vary from program to program. To be an effective data scientist, you will need to be able to reorganize your data to match the format required by your program.
|
||||
|
||||
In this chapter, you will learn the best way to organize your data for R, a task that we call data tidying. Tidying your data will save you hours of time and make your data much easier to visualize, transform, and model with R.
|
||||
|
||||
|
@ -79,7 +79,7 @@ At this point, you might think that tidy data is so obvious that it is trivial.
|
|||
|
||||
Tidy data works well with R because it takes advantage of R's traits as a vectorized programming language. Data structures in R are organized around vectors, and R's functions are optimized to work with vectors. Tidy data takes advantage of both of these traits.
|
||||
|
||||
Tidy data arranges values so that the relationships between variables in a data set will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the data set is assigned to its own column, i.e., its own vector in the data frame.
|
||||
Tidy data arranges values so that the relationships between variables in a data set will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the data set is assigned to its own column, i.e., its own vector in the data frame.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-2.png")
|
||||
|
@ -87,7 +87,7 @@ knitr::include_graphics("images/tidy-2.png")
|
|||
|
||||
*A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.*
|
||||
|
||||
As a result, you can extract the all of the values of a variable in a tidy data set by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
|
||||
As a result, you can extract all the values of a variable in a tidy data set by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
|
||||
|
||||
```{r}
|
||||
table1$cases
|
||||
|
@ -191,7 +191,7 @@ After you collect your input, you can calculate the rate.
|
|||
|
||||
```{r eval = FALSE}
|
||||
# Data set four
|
||||
cases <- c(table4$1999, table4$2000, table4$2001)
|
||||
cases <- c(table4$1999, table4$2000, table4$2001)
|
||||
population <- c(table5$1999, table5$2000, table5$2001)
|
||||
cases / population * 10000
|
||||
```
|
||||
|
@ -214,7 +214,7 @@ The two most important functions in `tidyr` are `gather()` and `spread()`. Each
|
|||
|
||||
A key value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So for example, this would be a key value pair:
|
||||
|
||||
Password: 0123456789
|
||||
Password: 0123456789
|
||||
|
||||
`0123456789` is the value, and it is associated with the key `Password`.
|
||||
|
||||
|
@ -238,7 +238,7 @@ Data values form natural key value pairs. The value is the value of the pair and
|
|||
Cases: 80488
|
||||
Cases: 212258
|
||||
Cases: 213766
|
||||
|
||||
|
||||
However, the key value pairs would cease to be a useful data set because you no longer know which values belong to the same observation.
|
||||
|
||||
Every cell in a table of data contains one half of a key value pair, as does every column name. In tidy data, each cell will contain a value and each column name will contain a key, but this doesn't need to be the case for untidy data. Consider `table2`.
|
||||
|
@ -247,7 +247,7 @@ Every cell in a table of data contains one half of a key value pair, as does eve
|
|||
table2
|
||||
```
|
||||
|
||||
In `table2`, the `key` column contains only keys (and not just because the column is labelled `key`). Conveniently, the `value` column contains the values associated with those keys.
|
||||
In `table2`, the `key` column contains only keys (and not just because the column is labeled `key`). Conveniently, the `value` column contains the values associated with those keys.
|
||||
|
||||
You can use the `spread()` function to tidy this layout.
|
||||
|
||||
|
@ -269,7 +269,7 @@ knitr::include_graphics("images/tidy-8.png")
|
|||
|
||||
*`spread()` distributes a pair of key:value columns into a field of cells. The unique keys in the key column become the column names of the field of cells.*
|
||||
|
||||
You can see that `spread()` maintains each of the relationships expressed in the original data set. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the orginal observations. As a bonus, now the layout of these relationships is tidy.
|
||||
You can see that `spread()` maintains each of the relationships expressed in the original data set. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations. As a bonus, now the layout of these relationships is tidy.
|
||||
|
||||
`spread()` takes three optional arguments in addition to `data`, `key`, and `value`:
|
||||
|
||||
|
@ -367,7 +367,7 @@ You can also pass an integer or vector of integers to `sep`. `separate()` will i
|
|||
separate(table3, year, into = c("century", "year"), sep = 2)
|
||||
```
|
||||
|
||||
You can futher customize `separate()` with the `remove`, `convert`, and `extra` arguments:
|
||||
You can further customize `separate()` with the `remove`, `convert`, and `extra` arguments:
|
||||
|
||||
- **`remove`** - Set `remove = FALSE` to retain the column of values that were separated in the final data frame.
|
||||
- **`convert`** - By default, `separate()` will return new columns as character columns. Set `convert = TRUE` to convert new columns to double (numeric), integer, logical, complex, and factor columns with `type.convert()`.
|
||||
|
@ -462,4 +462,4 @@ who <- spread(who, var, value)
|
|||
who
|
||||
```
|
||||
|
||||
The `who` data set is now tidy. It is far from sparkling (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
|
||||
The `who` data set is now tidy. It is far from sparkling (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
|
||||
|
|
Loading…
Reference in New Issue