Merge branch 'master' of github.com:hadley/r4ds
This commit is contained in:
commit
a938e56449
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
@ -42,7 +42,7 @@ partition(mtcars, 0.1)
|
|||
We'll generate 20 random test-training splits, and then create lists of test-training datasets:
|
||||
|
||||
```{r}
|
||||
partitions <- rerun(200, partition(mtcars, 0.25))
|
||||
partitions <- rerun(20, partition(mtcars, 0.25))
|
||||
|
||||
tst <- partitions %>% map(~mtcars[.x, , drop = FALSE])
|
||||
trn <- partitions %>% map(~mtcars[!.x, , drop = FALSE])
|
||||
|
|
|
@ -154,7 +154,7 @@ Like `mutate()`, the join functions add variables to the right, so if you have a
|
|||
|
||||
(When you're in RStudio, you can use `View()` to avoid this problem).
|
||||
|
||||
For example, imagine you want to add the full airline name to the `flights` data. You can combine the `airlines` and `carrier` data frames with `left_join()`:
|
||||
For example, imagine you want to add the full airline name to the `flights2` data. You can combine the `airlines` and `flights2` data frames with `left_join()`:
|
||||
|
||||
```{r}
|
||||
flights2 %>%
|
||||
|
|
|
@ -630,7 +630,7 @@ flights %>%
|
|||
summarise(mean = mean(dep_delay))
|
||||
```
|
||||
|
||||
We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. `x %>% f(y)` turns into `f(x, y)`ou'll learn more about aggregation functions in Section 5.7.4. Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
|
||||
We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. `x %>% f(y)` turns into `f(x, y)` you'll learn more about aggregation functions in Section 5.7.4. Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
|
||||
|
||||
```{r}
|
||||
flights %>%
|
||||
|
@ -731,7 +731,7 @@ batters %>% arrange(desc(ba))
|
|||
|
||||
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
|
||||
|
||||
### Other summary functions.
|
||||
### Other summary functions
|
||||
|
||||
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
|
||||
|
||||
|
|
167
variation.Rmd
167
variation.Rmd
|
@ -22,7 +22,7 @@ There is no formal way to do Exploratory Data Analysis because you must be free
|
|||
|
||||
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."---John Tukey
|
||||
|
||||
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to ask questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
|
||||
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
|
||||
|
||||
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
|
||||
|
||||
|
@ -40,7 +40,11 @@ The rest of this chapter will look at these two questions. I'll explain what var
|
|||
|
||||
* An _observation_ is a set of measurements that you make under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a data point.
|
||||
|
||||
* _tabular data_ is a set of values, each associated with a variable and an observation. Tabular data is _tidy_ if each value is placed in its own "cell", each variable in its own column, and each observation in its own row. Throughout the rest of this chapter I will use the word data to mean tidy tabular data.
|
||||
* _tabular data_ is a set of values, each associated with a variable and an observation. Tabular data is _tidy_ if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
|
||||
|
||||
Throughout the rest of this chapter I will use the word data to mean tidy tabular data. Other types of "unstructured" data exist, but you would not use the methods in this chapter on unstructured data until you first reorganized the unstructured data into tidy tabular data.
|
||||
|
||||
|
||||
|
||||
## Variation
|
||||
|
||||
|
@ -48,6 +52,10 @@ The rest of this chapter will look at these two questions. I'll explain what var
|
|||
|
||||
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
|
||||
|
||||
```{r include = FALSE}
|
||||
options(digits = 7)
|
||||
```
|
||||
|
||||
```{r, variation, echo = FALSE}
|
||||
|
||||
mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))
|
||||
|
@ -55,9 +63,13 @@ mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))
|
|||
knitr::kable(mat, caption = "*The speed of light is a universal constant, but variation due to measurement error obscures its value. In 1879, Albert Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
|
||||
```
|
||||
|
||||
```{r include = FALSE}
|
||||
options(digits = 3)
|
||||
```
|
||||
|
||||
Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
|
||||
|
||||
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you have observed for the variable.
|
||||
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you observe for the variable.
|
||||
|
||||
### Visualizing distributions
|
||||
|
||||
|
@ -68,7 +80,7 @@ ggplot(data = diamonds) +
|
|||
geom_bar(mapping = aes(x = cut))
|
||||
```
|
||||
|
||||
The height of the bars displays how many observations occurred at each x value. You can compute these values manually with `table()`.
|
||||
The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `table()`.
|
||||
|
||||
```{r}
|
||||
table(diamonds$cut)
|
||||
|
@ -108,7 +120,7 @@ ggplot(data = diamonds) +
|
|||
|
||||
### Asking questions about variation
|
||||
|
||||
Now that you can visualize variation, what should you look for in your plots? and what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
Now that you can visualize variation, what should you look for in your plots? And what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
|
||||
* *Typical values*
|
||||
|
||||
|
@ -207,11 +219,11 @@ table(diamonds$color, diamonds$cut)
|
|||
|
||||
### Visualizing one categorical variable and one continuous variable
|
||||
|
||||
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. Each boxplot consists of:
|
||||
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
|
||||
|
||||
* A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the Inter-Quartile Range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not it is symmetric about the median or skewed to one side.
|
||||
* A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the Inter-Quartile Range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
|
||||
|
||||
* Points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection.
|
||||
* Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection.
|
||||
|
||||
* A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
|
||||
|
||||
|
@ -221,19 +233,19 @@ knitr::include_graphics("images/EDA-boxplot.pdf")
|
|||
|
||||
The chart below shows several boxplots, one for each level of the class variable in the mpg data set. Each boxplot represents the distribution of hwy values for points with the given level of class. To make boxplots, use `geom_boxplot()`.
|
||||
|
||||
```{r}
|
||||
```{r fig.height = 3}
|
||||
ggplot(data = mpg) +
|
||||
geom_boxplot(aes(x = class, y = hwy))
|
||||
```
|
||||
|
||||
Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
|
||||
|
||||
```{r}
|
||||
```{r fig.height = 3}
|
||||
ggplot(data = mpg) +
|
||||
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))
|
||||
```
|
||||
|
||||
`geom_boxplot()` works best when the categorical variable is on the x axis. You can invert the axes with `coord_flip()`.
|
||||
`geom_boxplot()` works best when the categorical variable is mapped to the x aesthetic. You can invert the axes with `coord_flip()`.
|
||||
|
||||
```{r}
|
||||
ggplot(data = mpg) +
|
||||
|
@ -272,7 +284,7 @@ ggplot(data = diamonds) +
|
|||
geom_hex(aes(x = carat, y = price))
|
||||
```
|
||||
|
||||
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data when your data set is not big.
|
||||
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data even when your data set is not big.
|
||||
|
||||
|
||||
```{r}
|
||||
|
@ -287,7 +299,7 @@ When you explore plots of covariation, look for the following sources of insight
|
|||
|
||||
* *Outliers*
|
||||
|
||||
Two dimensional plots can reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of $x$ and $y$ values, which makes the points outliers even though their $x$ and $y$ values appear normal when examined separately.
|
||||
Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of $x$ and $y$ values, which makes the points outliers even though their $x$ and $y$ values appear normal when examined separately.
|
||||
|
||||
```{r echo = FALSE}
|
||||
ggplot(data = diamonds) +
|
||||
|
@ -298,9 +310,9 @@ When you explore plots of covariation, look for the following sources of insight
|
|||
|
||||
* *Clusters*
|
||||
|
||||
Plots of covariation can also reveal clusters that may not be visible in plots of variation. For example, the two dimensional pattern in the plot below reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
|
||||
Two dimensional plots can also reveal clusters that may not be visible in one dimensional plots. For example, the two dimensional pattern in the plot below reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
|
||||
|
||||
```{r echo = FALSE}
|
||||
```{r echo = FALSE, fig.height = 3}
|
||||
ggplot(data = iris, aes(y = Sepal.Length, x = Sepal.Width)) +
|
||||
geom_jitter() +
|
||||
geom_density2d(h= c(1,1)) +
|
||||
|
@ -321,7 +333,7 @@ When you explore plots of covariation, look for the following sources of insight
|
|||
|
||||
+ Does the relationship change if you look at individual subgroups of the data?
|
||||
|
||||
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also reveals the two clusters that we noticed above.
|
||||
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.
|
||||
|
||||
```{r echo = FALSE, message = FALSE, fig.height = 2}
|
||||
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
|
||||
|
@ -337,7 +349,8 @@ You can extend scatterplots into three dimensions with the plotly, rgl, rglwidge
|
|||
|
||||
```{r eval = FALSE}
|
||||
library(plotly)
|
||||
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width, color = Species, type = "scatter3d", mode = "markers")
|
||||
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width,
|
||||
color = Species, type = "scatter3d", mode = "markers")
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
|
@ -358,20 +371,29 @@ This doesn't mean that you should ignore complex interactions in your data. You
|
|||
|
||||
## Clusters
|
||||
|
||||
A clustering algorithm computes the distances between data points in n-dimensional space. It then uses an algorithm to group points into clusters based on how near or far they are from each other. Base R provides two easy to use clustering algrotihms: heirarchical clustering and k means clustering.
|
||||
Cluster algorithms are automated tools that seek out clusters in n-dimensional space for you. Base R provides two easy to use clustering algorithms: hierarchical clustering and k means clustering.
|
||||
|
||||
### Heirarchical clustering
|
||||
### Hierarchical clustering
|
||||
|
||||
The heirarchical clustering algorithm groups points together based on how near they are to each other in n-dimensional space. The algorithm proceeds in stages until every point has been grouped into a single cluster, the data set. You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters.
|
||||
Hierarchical clustering uses a simple algorithm to locate groups of points that are near each other in n-dimensional space:
|
||||
|
||||
1. Identify the two points that are closest to each other
|
||||
2. Combine these points into a cluster
|
||||
3. Treat the new cluster as a point
|
||||
4. Repeat until all of the points are grouped into a single cluster
|
||||
|
||||
You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters. The figure below demonstrates how the algorithm would proceed in a two dimensional data set.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-plotly.png")
|
||||
knitr::include_graphics("images/EDA-hclust.pdf")
|
||||
```
|
||||
|
||||
You can only apply heirarchical clustering to numeric data, so begin by selecting the numeric columns from your data set. Then apply the `dist()` function to the data and pass the results to `hcust()`. `dist()` computes the distances between your points in the n dimensional space defined by your numeric vectors. `hclust()` performs the clustering algorithm.
|
||||
To use hierarchical clustering in R, begin by selecting the numeric columns from your data; you can only apply hierarchical clustering to numeric data. Then apply the `dist()` function to the data and pass the results to `hclust()`. `dist()` computes the distances between your points in the n dimensional space defined by your numeric vectors. `hclust()` performs the clustering algorithm.
|
||||
|
||||
```{r}
|
||||
iris_hclust <- iris %>%
|
||||
small_iris <- sample_n(iris, 50)
|
||||
|
||||
iris_hclust <- small_iris %>%
|
||||
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
|
||||
dist() %>%
|
||||
hclust(method = "complete")
|
||||
|
@ -380,18 +402,10 @@ iris_hclust <- iris %>%
|
|||
Use `plot()` to visualize the results as a dendrogram. Each observation in the data set will appear at the bottom of the dendrogram labeled by its rowname. You can use the labels argument to set the labels to something more informative.
|
||||
|
||||
```{r fig.height = 4}
|
||||
plot(iris_hclust, labels = iris$Species)
|
||||
plot(iris_hclust, labels = small_iris$Species)
|
||||
```
|
||||
|
||||
To see how near two data points are to each other, trace the paths of the data points up through the tree until they intersect. The y value of the intersection displays how far apart the points are in n-dimensional space. Points that are close to each other will intersect at a small y value, points that are far from each other will intersect at a large y value. Groups of points that are near each other will look like "leaves" that all grow on the same "branch."
|
||||
|
||||
The ordering of the x axis in the dendrogram is somewhat arbitrary (think of the tree as a mobile, each horizontal branch can spin around meaninglessly).
|
||||
|
||||
Use the `identify()` function to easily see easily which group of points are downstream from a branch. `identify()` will plot the dendrogram in an interactive format. When you click on a branch, R will draw a red rectangle around the downstream points. Clikc escape when you are finished.
|
||||
|
||||
```{r eval = FALSE}
|
||||
identify(iris_hclust)
|
||||
```
|
||||
To see how near two data points are to each other, trace the paths of the data points up through the tree until they intersect. The y value of the intersection displays how far apart the points are in n-dimensional space. Points that are close to each other will intersect at a small y value, points that are far from each other will intersect at a large y value. Groups of points that are near each other will look like "leaves" that all grow on the same "branch." The ordering of the x axis in the dendrogram is somewhat arbitrary (think of the tree as a mobile, each horizontal branch can spin around meaninglessly).
|
||||
|
||||
You can split your data into any number of clusters by drawing a horizontal line across the tree. Each vertical branch that the line crosses will represent a cluster that contains all of the points downstream from the branch. Move the line up the y axis to intersect fewer branches (and create fewer clusters), move the line down the y axis to intersect more branches and (create more clusters).
|
||||
|
||||
|
@ -400,15 +414,19 @@ You can split your data into any number of clusters by drawing a horizontal line
|
|||
```{r}
|
||||
(clusters <- cutree(iris_hclust, 3))
|
||||
|
||||
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
geom_point(aes(color = factor(clusters), shape = Species))
|
||||
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
geom_point(aes(color = factor(clusters)))
|
||||
```
|
||||
|
||||
You can modify the heirarchical clustering algorithm by setting the method argument of hclust to one of "complete", "single", "average", or "centroid". The method determines how to measure the distance between two clusters or a lone point and a cluster, a measurement that effects the outcome of the algorithm.
|
||||
You can modify the hierarchical clustering algorithm by setting the method argument of hclust to one of "complete", "single", "average", or "centroid". The method determines how to measure the distance between two clusters or a lone point and a cluster, a measurement that effects the outcome of the algorithm.
|
||||
|
||||
* *complete* - Measures the greatest distance between any two points. Tends to create distinct clusters and subclusters.
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-linkage.pdf")
|
||||
```
|
||||
|
||||
* *single* - Measures the smallest distance between any two points in the clusters. Tends to add points one at a time to existing clusters, creating ambiguously defined clusters.
|
||||
* *complete* - Measures the greatest distance between any two points in the separate clusters. Tends to create distinct clusters and subclusters.
|
||||
|
||||
* *single* - Measures the smallest distance between any two points in the separate clusters. Tends to add points one at a time to existing clusters, creating ambiguously defined clusters.
|
||||
|
||||
* *average* - Measures the average distance between all combinations of points in the separate clusters. Tends to add points one at a time to existing clusters.
|
||||
|
||||
|
@ -416,41 +434,44 @@ You can modify the heirarchical clustering algorithm by setting the method argum
|
|||
|
||||
|
||||
```{r fig.height = 4}
|
||||
iris %>%
|
||||
small_iris %>%
|
||||
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
|
||||
dist() %>%
|
||||
hclust(method = "single") %>%
|
||||
plot(labels = iris$Species)
|
||||
plot(labels = small_iris$Species)
|
||||
```
|
||||
|
||||
|
||||
### K means clustering
|
||||
|
||||
K means clustering provides a simulation based alternative to heirarchical clustering. It identifies the "best" way to group your data into a predefined number of clusters.
|
||||
K means clustering provides a simulation based alternative to hierarchical clustering. It identifies the "best" way to group your data into a pre-defined number of clusters. The figure below visualizes (in two dimensional space) the k means algorith:
|
||||
|
||||
1. Randomly assign each data point to one of $k$ groups
|
||||
2. Compute the centroid of each group
|
||||
3. Reassign each point to the group whose centroid it is nearest to
|
||||
4. Repeat steps 2 and 3 until group memberships cease to change
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-plotly.png")
|
||||
knitr::include_graphics("images/EDA-kmeans.pdf")
|
||||
```
|
||||
|
||||
Use `kmeans()` to perform k means clustering with R. As with heirarchical clustering, you can only apply k means clustering to numerical data. Pass your numerical data to the `kmeans()` function, then set `center` to the number of clusters to search for ($k$) and `nstart` to the number of simulations to run. Since the results of k means clustering depend on the initial assignment of points to groups, which is random, R will run `nstart` k means simulations and then return the best results (as measured by the minimum sum of squared distances between each point and the centroid of the group it is assigned to).
|
||||
|
||||
Finally, set the maximum number of iterations to let each simulation run in case the simulation cannot quickly find a stable grouping.
|
||||
Use `kmeans()` to perform k means clustering with R. As with hierarchical clustering, you can only apply k means clustering to numerical data. Pass your numerical data to the `kmeans()` function, then set `center` to the number of clusters to search for ($k$) and `nstart` to the number of simulations to run. Since the results of k means clustering depend on the initial assignment of points to groups, which is random, R will run `nstart` simulations and then return the best results (as measured by the minimum sum of squared distances between each point and the centroid of the group it is assigned to). Finally, set the maximum number of iterations to let each simulation run in case the simulation cannot quickly find a stable grouping.
|
||||
|
||||
```{r}
|
||||
iris_kmeans <- iris %>%
|
||||
iris_kmeans <- small_iris %>%
|
||||
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
|
||||
kmeans(centers = 3, nstart = 20, iter.max = 50)
|
||||
|
||||
iris_kmeans$cluster
|
||||
```
|
||||
|
||||
Unlike `hclust()` the k means algorithm does not porvide an intuitive visualize interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access list of cluster assignments for your data set, like `cutree()`, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
|
||||
Unlike `hclust()`, the k means algorithm does not porvide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your data set, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
|
||||
|
||||
```{r}
|
||||
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
geom_point(aes(color = factor(iris_kmeans$cluster), shape = Species))
|
||||
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
geom_point(aes(color = factor(iris_kmeans$cluster)))
|
||||
|
||||
iris %>%
|
||||
small_iris %>%
|
||||
group_by(iris_kmeans$cluster) %>%
|
||||
summarise(n_obs = n(), avg_width = mean(Sepal.Width), avg_length = mean(Sepal.Length))
|
||||
```
|
||||
|
@ -458,7 +479,7 @@ iris %>%
|
|||
|
||||
### Asking questions about clustering
|
||||
|
||||
Both algorithms _will always_ return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about clustering algorithms. Ask yourself:
|
||||
Ask the same questions about clusters that you find with `hclust()` and `kmeans()` that you would ask about clusters that you find with a graph. Ask yourself:
|
||||
|
||||
* Do the clusters seem to identify real differences between your points? How can you tell?
|
||||
|
||||
|
@ -470,7 +491,7 @@ Both algorithms _will always_ return a set of clusters, whether your data appear
|
|||
|
||||
* How stable are the clusters if you re-run the algorithm?
|
||||
|
||||
Remember to use the results of clustering as a tool for exploration. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.
|
||||
Keep in mind that both algorithms _will always_ return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about the results. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.
|
||||
|
||||
## Models
|
||||
|
||||
|
@ -495,7 +516,7 @@ $$\hat{y} = 0.13 + 0.98 x$$
|
|||
|
||||
which is the equation of the blue model line in the graph above. Even if we did not have the graph, we could use the model coefficients in the equation above to determine that a positive relationship exists between $y$ and $x$ such that a one unit increase in $x$ is associated with an approximately one unit increase in $y$. We could use a model statistic, such as adjusted $r^{2}$ to determine that the relationship is very strong (here adjusted $r^{2} = 0.99$).
|
||||
|
||||
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have a residual that is an outlier in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot.
|
||||
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have residuals that are outliers in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot.
|
||||
|
||||
```{r echo = FALSE, fig.width = 3, fig.show='hold'}
|
||||
diamond_mod <- lm(y ~ x, data = diamonds3)
|
||||
|
@ -508,15 +529,17 @@ ggplot(resids) +
|
|||
geom_point(aes(x = x, y = .resid))
|
||||
```
|
||||
|
||||
You can examine coefficients, model statistics, and residuals of a model fit to an n-dimensional relationship in the same way, without visualizing the raw data in n-dimensional space.
|
||||
You can easily use these techniques with n dimensional relationships that cannot be visualized easily. When you spot a pattern or outlier, ask yourself the same questions that you would ask when you spot a pattern or outlier in a graph. Then visualize the residuals of your model in various ways. If a pattern exists in the residuals, it suggests that your model does not accurately describe the pattern in your data.
|
||||
|
||||
I'll postpone teaching you how to fit and interpret models with R until Part 4. Altough models are something simple, a description of your data set, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. You'll be able to do more with models if you learn a few more skills before you begin to model data.
|
||||
I'll postpone teaching you how to fit and interpret models with R until Part 4. Although models are something simple, descriptions of patterns, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. As a result, there is more to learn about models than we can examine here.
|
||||
|
||||
## Exploring further
|
||||
|
||||
Every data set contains more variables and observations than it displays. You can use the values in your data to compute new variables or to measure new new, group-level observations on subgroups of your data. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
|
||||
> Every data set contains more variables and observations than it displays.
|
||||
|
||||
### Making new variables
|
||||
You now know how to explore the variables displayed in your data set, but you should know that these are not the only variables in your data. Nor are the observations that are displayed in your data the only observations. You can use the values in your data to compute new variables or to measure new (group-level) observations. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
|
||||
|
||||
### To make new variables
|
||||
|
||||
Use dplyr's `mutate()` function to calculate new variables from your existing variables.
|
||||
|
||||
|
@ -526,13 +549,13 @@ diamonds %>%
|
|||
head()
|
||||
```
|
||||
|
||||
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purr. You will learn more about purrr in Chapter ?.
|
||||
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purrr. You will learn more about purrr in Chapter ?.
|
||||
|
||||
Statisticians can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for Principle Components Analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
|
||||
If you are statistically trained, you can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for Principle Components Analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
|
||||
|
||||
### Making new observations
|
||||
### To make new observations
|
||||
|
||||
If your data set contains subgroups, you can derive from your data a new data set of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level values. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
|
||||
If your data set contains subgroups, you can derive from your data a new data set of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level statistics. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
|
||||
|
||||
```{r}
|
||||
mpg %>%
|
||||
|
@ -542,41 +565,41 @@ mpg %>%
|
|||
|
||||
## A last word on variables, values, and observations
|
||||
|
||||
Variables, values, and observations provide a basis for Exploratory Data Analysis: if a relationship exists between two variables, then the relationship will exist between the values of those variables when those values are measured in the same observation. As a result, relationships between variables will appear as patterns in your data.
|
||||
Variables, values, and observations provide a basis for Exploratory Data Analysis: _if a relationship exists between two_ variables, _then the relationship will exist between the_ values _of those variables when those values are measured in the same_ observation. As a result, relationships between variables will appear as patterns in your data.
|
||||
|
||||
Within any particular observation, the exact form of the relationship between values may be obscured by mediating factors, measurement error, or random noise; which means the patterns in your data will appear as signals obscured by noise.
|
||||
Within any particular observation, the exact form of the relationship between variables may be obscured by mediating factors, measurement error, or random noise; which means that the patterns in your data will appear as signals obscured by noise.
|
||||
|
||||
Due to a quirk of the human cognitive system, the easiest way to spot the signal admidst the noise is to visualize your data. The concepts of variables, values, and observations make this easy to do. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. Or compute group-level statistics (i.e. observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
|
||||
Due to a quirk of the human cognitive system, the easiest way to spot signal admidst noise is to visualize your data. The concepts of variables, values, and observations have a role to play here as well. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. You could also compute group-level statistics from your data (i.e. new observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
|
||||
|
||||
## Exploratory Data Analysis and Data Science
|
||||
|
||||
As a term, data science has been used in many ways by different people. This fluidity is necessary for a term that describes a wide breadth of activity, as data science does. Although different data science activities will take different forms, you can use the principles in this chapter to build a general model of data science. The model requires one limit to the definition of data science: data science must rely in some way on human judgement and expertise.
|
||||
As a term, "data science" has been used in different ways by many people. This fluidity is necessary for a term that describes a wide breadth of activity, as data science does. Nonetheless, you can use the principles in this chapter to build a general model of data science. The model requires one limit to the definition of data science: data science must rely in some way on human judgement applied to data.
|
||||
|
||||
To judge or interpret the information in a data set, you must first comprehend that information. Data is difficult to comprehend, which means that you need to visualize, model, and transform it, a process that we have referred to as Exploratory Data Analysis.
|
||||
To judge or interpret the information in a data set, you must first comprehend that information, which is difficult to do. The easiest way to comprehend data is to visualize, transform, and model it, a process that we have referred to as Exploratory Data Analysis.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-data-science-1.png")
|
||||
```
|
||||
|
||||
Once you comprehend the information in your data, you can use it to make inferences from your data. Often this involves making deductions from a model. This is what you do when you conduct a hypothesis test, make a prediction (wth or without a confidence interval), or score cases in a database.
|
||||
Once you comprehend the information in your data, you can use it to make inferences from your data. Often this involves making deductions from a model. This is what you do when you conduct a hypothesis test, make a prediction (with or without a confidence interval), or score cases in a database.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-data-science-2.png")
|
||||
```
|
||||
|
||||
But all of this will involve a computer; you can make little headway with pencil and paper calculations when you work with data. To work efficiently, you will need to know how to program in a computer language, such as R, import data to use with that language, and tidy the data into the format that works best for that language.
|
||||
But all of this work will involve a computer; you cannot do it in your head, nor on paper with a pencil. To work efficiently, you will need to know how to program in a computer language, such as R. You will also need to know how to import data to use with the computer language, and how to tidy the data into the format that works best for that computer language.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-data-science-3.png")
|
||||
```
|
||||
|
||||
Finally, if your work is meaningful at all you will need to report it in a way that your audience can understand. Your audience might be fellow scientists who will want to ensure that the work is reproducible, non-scientists who will need to understand your findings in plain language, or future you who will be thankful if you make it easy to come back up to speed on your work and recreate it as necessary. To satisfy these audiences, you may choose to communicate your results in a report or to bundle your work into some type of useful format, like a package or a Shiny app.
|
||||
Finally, if your work is meaningful at all, it will have an audience, which means that you will need to share your work in a way that your audience can understand. Your audience might be fellow scientists who will want to reproduce the work, non-scientists who will want to understand your findings in plain terms, or yourself (in the future) who will be thankful if you make your work easy to re-learn and recreate. To satisfy these audiences, you may choose to communicate your results in a report or to bundle your work into some type of useful format, like an R package or a Shiny app.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-data-science-4.png")
|
||||
```
|
||||
|
||||
This model forms a roadmap for the rest of the book.
|
||||
This model od data science forms a roadmap for the rest of the book.
|
||||
|
||||
* Part 1 of the book covered the central tasks of the model above, Exploratory Data Analysis.
|
||||
|
||||
|
@ -584,9 +607,9 @@ This model forms a roadmap for the rest of the book.
|
|||
|
||||
* Part 3 will teach you some of the most efficient ways to program in R with data.
|
||||
|
||||
* Part 4 discusses models and how to apply them.
|
||||
* Part 4 will discuss models and how to apply them.
|
||||
|
||||
* Part 5 will teach you the most popular format for reporting and reproducing the results of an R analysis.
|
||||
* Part 5 will teach you R Markdown, the most popular format for reporting and reproducing the results of an R analysis.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-data-science-5.png")
|
||||
|
|
Loading…
Reference in New Issue