Fix typos
This commit is contained in:
parent
37979814ca
commit
509d9d8c8d
|
@ -120,7 +120,7 @@ ggplot(data = diamonds) +
|
|||
|
||||
### Asking questions about variation
|
||||
|
||||
Now that you can visualize variation, what should you look for in your plots? And what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
|
||||
* *Typical values*
|
||||
|
||||
|
@ -211,7 +211,7 @@ ggplot(data = diamonds) +
|
|||
geom_count(mapping = aes(x = cut, y = color))
|
||||
```
|
||||
|
||||
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specifc x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.
|
||||
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.
|
||||
|
||||
```{r}
|
||||
table(diamonds$color, diamonds$cut)
|
||||
|
@ -238,7 +238,7 @@ ggplot(data = mpg) +
|
|||
geom_boxplot(aes(x = class, y = hwy))
|
||||
```
|
||||
|
||||
Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
|
||||
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
|
||||
|
||||
```{r fig.height = 3}
|
||||
ggplot(data = mpg) +
|
||||
|
@ -444,7 +444,7 @@ small_iris %>%
|
|||
|
||||
### K means clustering
|
||||
|
||||
K means clustering provides a simulation based alternative to hierarchical clustering. It identifies the "best" way to group your data into a pre-defined number of clusters. The figure below visualizes (in two dimensional space) the k means algorith:
|
||||
K means clustering provides a simulation based alternative to hierarchical clustering. It identifies the "best" way to group your data into a predefined number of clusters. The figure below visualizes (in two dimensional space) the k means algorithm:
|
||||
|
||||
1. Randomly assign each data point to one of $k$ groups
|
||||
2. Compute the centroid of each group
|
||||
|
@ -465,7 +465,7 @@ iris_kmeans <- small_iris %>%
|
|||
iris_kmeans$cluster
|
||||
```
|
||||
|
||||
Unlike `hclust()`, the k means algorithm does not porvide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your data set, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
|
||||
Unlike `hclust()`, the k means algorithm does not provide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your data set, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
|
||||
|
||||
```{r}
|
||||
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
|
@ -489,7 +489,7 @@ Ask the same questions about clusters that you find with `hclust()` and `kmeans(
|
|||
|
||||
* Might there be a mismatch between the number of clusters that you found and the number that exist in real life? Are only a couple of the clusters meaningful? Are there more clusters in the data than you found?
|
||||
|
||||
* How stable are the clusters if you re-run the algorithm?
|
||||
* How stable are the clusters if you rerun the algorithm?
|
||||
|
||||
Keep in mind that both algorithms _will always_ return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about the results. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.
|
||||
|
||||
|
@ -516,7 +516,7 @@ $$\hat{y} = 0.13 + 0.98 x$$
|
|||
|
||||
which is the equation of the blue model line in the graph above. Even if we did not have the graph, we could use the model coefficients in the equation above to determine that a positive relationship exists between $y$ and $x$ such that a one unit increase in $x$ is associated with an approximately one unit increase in $y$. We could use a model statistic, such as adjusted $r^{2}$ to determine that the relationship is very strong (here adjusted $r^{2} = 0.99$).
|
||||
|
||||
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have residuals that are outliers in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot.
|
||||
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have residuals that are outliers in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimensional plot.
|
||||
|
||||
```{r echo = FALSE, fig.width = 3, fig.show='hold'}
|
||||
diamond_mod <- lm(y ~ x, data = diamonds3)
|
||||
|
@ -529,7 +529,7 @@ ggplot(resids) +
|
|||
geom_point(aes(x = x, y = .resid))
|
||||
```
|
||||
|
||||
You can easily use these techniques with n dimensional relationships that cannot be visualized easily. When you spot a pattern or outlier, ask yourself the same questions that you would ask when you spot a pattern or outlier in a graph. Then visualize the residuals of your model in various ways. If a pattern exists in the residuals, it suggests that your model does not accurately describe the pattern in your data.
|
||||
You can easily use these techniques with n-dimensional relationships that cannot be visualized easily. When you spot a pattern or outlier, ask yourself the same questions that you would ask when you spot a pattern or outlier in a graph. Then visualize the residuals of your model in various ways. If a pattern exists in the residuals, it suggests that your model does not accurately describe the pattern in your data.
|
||||
|
||||
I'll postpone teaching you how to fit and interpret models with R until Part 4. Although models are something simple, descriptions of patterns, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. As a result, there is more to learn about models than we can examine here.
|
||||
|
||||
|
@ -551,7 +551,7 @@ diamonds %>%
|
|||
|
||||
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purrr. You will learn more about purrr in Chapter ?.
|
||||
|
||||
If you are statistically trained, you can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for Principle Components Analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
|
||||
If you are statistically trained, you can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for principal component analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
|
||||
|
||||
### To make new observations
|
||||
|
||||
|
@ -569,7 +569,7 @@ Variables, values, and observations provide a basis for Exploratory Data Analysi
|
|||
|
||||
Within any particular observation, the exact form of the relationship between variables may be obscured by mediating factors, measurement error, or random noise; which means that the patterns in your data will appear as signals obscured by noise.
|
||||
|
||||
Due to a quirk of the human cognitive system, the easiest way to spot signal admidst noise is to visualize your data. The concepts of variables, values, and observations have a role to play here as well. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. You could also compute group-level statistics from your data (i.e. new observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
|
||||
Due to a quirk of the human cognitive system, the easiest way to spot signal amidst noise is to visualize your data. The concepts of variables, values, and observations have a role to play here as well. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. You could also compute group-level statistics from your data (i.e. new observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
|
||||
|
||||
## Exploratory Data Analysis and Data Science
|
||||
|
||||
|
@ -599,7 +599,7 @@ Finally, if your work is meaningful at all, it will have an audience, which mean
|
|||
knitr::include_graphics("images/EDA-data-science-4.png")
|
||||
```
|
||||
|
||||
This model of data science forms a roadmap for the rest of the book.
|
||||
This model of data science forms a road map for the rest of the book.
|
||||
|
||||
* Part 1 of the book covered the central tasks of the model above, Exploratory Data Analysis.
|
||||
|
||||
|
|
Loading…
Reference in New Issue