parent
74cb7d57f6
commit
a0eba42266
22
EDA.Rmd
22
EDA.Rmd
|
@ -8,7 +8,7 @@ This chapter will show you how to use visualisation and transformation to explor
|
|||
|
||||
1. Search for answers by visualising, transforming, and modelling your data.
|
||||
|
||||
1. Use what you learn to refine your questions and or generate new questions.
|
||||
1. Use what you learn to refine your questions and/or generate new questions.
|
||||
|
||||
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
|
||||
|
||||
|
@ -34,7 +34,7 @@ library(dplyr)
|
|||
|
||||
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
|
||||
|
||||
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
|
||||
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.
|
||||
|
||||
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
|
||||
|
||||
|
@ -97,7 +97,7 @@ diamonds %>%
|
|||
count(cut)
|
||||
```
|
||||
|
||||
A variable is **continuous** if can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
|
||||
A variable is **continuous** if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
|
||||
|
||||
```{r}
|
||||
ggplot(data = diamonds) +
|
||||
|
@ -111,7 +111,7 @@ diamonds %>%
|
|||
count(cut_width(carat, 0.5))
|
||||
```
|
||||
|
||||
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
|
||||
A histogram divides the x-axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
|
||||
|
||||
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
|
||||
|
||||
|
@ -132,7 +132,7 @@ ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
|
|||
|
||||
There are a few challenges with this type of plot, which we will come back to in [visualising a categorical and a continuous variable](#cat-cont).
|
||||
|
||||
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
|
||||
### Typical values
|
||||
|
||||
|
@ -244,7 +244,7 @@ If you've encountered unusual values in your dataset, and simply want to move on
|
|||
variable you might find that you don't have any data left!
|
||||
|
||||
1. Instead, I recommend replacing the unusual values with missing values.
|
||||
The easiest way to do this is use `mutate()` to replace the variable
|
||||
The easiest way to do this is to use `mutate()` to replace the variable
|
||||
with a modified copy. You can use the `ifelse()` function to replace
|
||||
unusual values with `NA`:
|
||||
|
||||
|
@ -288,7 +288,7 @@ However this plot isn't great because there are many more non-cancelled flights
|
|||
### Exercises
|
||||
|
||||
1. What happens to missing values in a histogram? What happens to missing
|
||||
values in bar chart? Why is there a difference?
|
||||
values in a bar chart? Why is there a difference?
|
||||
|
||||
1. What does `na.rm = TRUE` do in `mean()` and `sum()`?
|
||||
|
||||
|
@ -298,7 +298,7 @@ If variation describes the behavior _within_ a variable, covariation describes t
|
|||
|
||||
### A categorical and continuous variable {#cat-cont}
|
||||
|
||||
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:
|
||||
It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:
|
||||
|
||||
```{r}
|
||||
ggplot(data = diamonds, mapping = aes(x = price)) +
|
||||
|
@ -332,7 +332,7 @@ Another alternative to display the distribution of a continuous variable broken
|
|||
|
||||
* Visual points that display observations that fall more than 1.5 times the
|
||||
IQR from either edge of the box. These outlying points are unusual
|
||||
so are plotted individually
|
||||
so are plotted individually.
|
||||
|
||||
* A line (or whisker) that extends from each end of the box and goes to the
|
||||
farthest non-outlier point in the distribution.
|
||||
|
@ -593,8 +593,8 @@ diamonds %>%
|
|||
|
||||
## Learning more
|
||||
|
||||
If you want learn more about the mechanics ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
|
||||
If you want learn more about the mechanics of ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
|
||||
|
||||
Another useful resource is the [_R Graphics Cookbook_](https://amzn.com/1449316956) by Winston Chang. Much of the contents are available online at <http://www.cookbook-r.com/Graphs/>.
|
||||
|
||||
I also recommend [_Graphical Data Analysis with R_](https://amzn.com/1498715230), by Antony Unwin. This is a book length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
|
||||
I also recommend [_Graphical Data Analysis with R_](https://amzn.com/1498715230), by Antony Unwin. This is a book-length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
|
||||
|
|
|
@ -8,9 +8,9 @@ The goal of the first part of this book is to get you up to speed with the basic
|
|||
knitr::include_graphics("diagrams/data-science-explore.png")
|
||||
```
|
||||
|
||||
You will get frustrated when you start programming in R, because it is such a stickler for mistakes. Even one character out of place will cause it to complain. However, that frustration is both typical and temporary. It happens to everyone, and the only way to get over it is to keep trying.
|
||||
You will get frustrated when you start programming in R, because it is such a stickler. Even one character out of place will cause it to complain. However, that frustration is both typical and temporary. It happens to everyone, and the only way to get over it is to keep trying.
|
||||
|
||||
The goal of this part of the book is to get you to some useful tools with an immediate payoff as quickly as possible:
|
||||
The goal of this part of the book is to get you some useful tools with an immediate payoff as quickly as possible:
|
||||
|
||||
* Visualisation is a great place to start with R programming, because the
|
||||
payoff is so clear: you get to make elegant and informative plots that help
|
||||
|
|
|
@ -89,7 +89,7 @@ Another option that commonly needs tweaking is `na`: this specifies the value (o
|
|||
read_csv("a,b,c\n1,2,.", na = ".")
|
||||
```
|
||||
|
||||
This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them in to R vectors.
|
||||
This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
|
||||
|
||||
### Compared to base R
|
||||
|
||||
|
|
|
@ -98,7 +98,7 @@ It's common to think about modelling as a tool for hypothesis confirmation, and
|
|||
|
||||
## Prerequisites
|
||||
|
||||
We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
|
||||
We've made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
|
||||
|
||||
To run the code in this book, you will need to install both R and the RStudio IDE. Both are open source, free, and easy to install:
|
||||
|
||||
|
@ -196,7 +196,7 @@ There are three things you need to include to make your example reproducible: re
|
|||
|
||||
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
|
||||
|
||||
You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
|
||||
You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
|
||||
|
||||
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world. If you're an active Twitter user, follow the `#rstats` hashtag. Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.
|
||||
|
||||
|
|
|
@ -8,7 +8,7 @@ In this part of the book, you'll improve your programming skills. Programming is
|
|||
knitr::include_graphics("diagrams/data-science-program.png")
|
||||
```
|
||||
|
||||
Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand your why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.
|
||||
Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves get better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.
|
||||
|
||||
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)
|
||||
|
||||
|
@ -35,7 +35,7 @@ In the following four chapters, you'll learn skills that will allow you to both
|
|||
|
||||
## Learning more
|
||||
|
||||
The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount. Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills. Learning more about programming is a long-term investment: it won't pay off immediately, but in the long-term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
|
||||
The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount. Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills. Learning more about programming is a long-term investment: it won't pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
|
||||
|
||||
To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
|
||||
|
||||
|
@ -49,5 +49,5 @@ To learn more you need to study R as a programming language, not just an interac
|
|||
* [_Advanced R_](https://amzn.com/1466586966) by Hadley Wickham. This dives into the
|
||||
details of R the programming language. This is a great place to start if you
|
||||
have existing programming experience. It's also a great next step once you've
|
||||
internalised the ideas in these chapters. You can read it online at at
|
||||
internalised the ideas in these chapters. You can read it online at
|
||||
<http://adv-r.had.co.nz>.
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
## Introduction
|
||||
|
||||
Throughout this book we work with "tibbles" instead of R's traditional data.frame. Tibbles _are_ data frames, but they tweak some older behaviours to make life a littler easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier.
|
||||
Throughout this book we work with "tibbles" instead of R's traditional data.frame. Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier.
|
||||
|
||||
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.
|
||||
|
||||
|
@ -146,7 +146,7 @@ The main reason that some older functions don't work with tibble is the `[` func
|
|||
|
||||
## Exercises
|
||||
|
||||
1. How can you tell if an object is a tibble? (Hint: trying print `mtcars`,
|
||||
1. How can you tell if an object is a tibble? (Hint: try printing `mtcars`,
|
||||
which is a regular data frame).
|
||||
|
||||
1. Practice referring to non-syntactic names by:
|
||||
|
|
20
tidy.Rmd
20
tidy.Rmd
|
@ -8,7 +8,7 @@
|
|||
> "Tidy datasets are all alike, but every messy dataset is messy in its
|
||||
> own way." --– Hadley Wickham
|
||||
|
||||
In this chapter, you will learn a consistent way to organise your data in R, an organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
|
||||
In this chapter, you will learn a consistent way to organise your data in R, an organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
|
||||
|
||||
This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
|
||||
|
||||
|
@ -23,7 +23,7 @@ library(dplyr)
|
|||
|
||||
## Tidy data
|
||||
|
||||
You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in different way.
|
||||
You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in a different way.
|
||||
|
||||
```{r}
|
||||
table1
|
||||
|
@ -68,7 +68,7 @@ Why ensure that your data is tidy? There are two main advantages:
|
|||
built-in R functions work with vectors of values. That makes transforming
|
||||
tidy data feel particularly natural.
|
||||
|
||||
dplyr, ggplot2, and all other the packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`.
|
||||
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`.
|
||||
|
||||
```{r, out.width = "50%"}
|
||||
# Compute rate per 10,000
|
||||
|
@ -256,7 +256,7 @@ As you might have guessed from the common `key` and `value` arguments, `spread()
|
|||
|
||||
## Separating and uniting
|
||||
|
||||
So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
|
||||
So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
|
||||
|
||||
### Separate
|
||||
|
||||
|
@ -286,7 +286,7 @@ table3 %>%
|
|||
|
||||
(Formally, `sep` is a regular expression, which you'll learn more about in [strings].)
|
||||
|
||||
Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
|
||||
Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
|
||||
|
||||
```{r}
|
||||
table3 %>%
|
||||
|
@ -304,7 +304,7 @@ table3 %>%
|
|||
|
||||
### Unite
|
||||
|
||||
`unite()` is inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
|
||||
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
|
||||
|
||||
```{r tidy-unite, echo = FALSE, out.width = "75%", fig.cap = "Uniting `table5` makes it tidy"}
|
||||
knitr::include_graphics("images/tidy-18.png")
|
||||
|
@ -370,7 +370,7 @@ There are two missing values in this dataset:
|
|||
|
||||
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
|
||||
|
||||
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit putting years in the columns:
|
||||
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:
|
||||
|
||||
```{r}
|
||||
stocks %>%
|
||||
|
@ -431,7 +431,7 @@ who
|
|||
|
||||
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline.
|
||||
|
||||
The best place to start is almost always to gathering together the columns that are not variables. Let's have a look at what we've got:
|
||||
The best place to start is almost always to gather together the columns that are not variables. Let's have a look at what we've got:
|
||||
|
||||
* It looks like `country`, `iso2`, and `iso3` are three variables that
|
||||
redundantly specify the country.
|
||||
|
@ -486,7 +486,7 @@ You might be able to parse this out by yourself with a little thought and some e
|
|||
* `5564` = 55 -- 64 years old
|
||||
* `65` = 65 or older
|
||||
|
||||
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the characters "newrel" with "new_rel". This makes all variable names consistent.
|
||||
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the characters "newrel" with "new_rel". This makes all variable names consistent.
|
||||
|
||||
```{r}
|
||||
who2 <- who1 %>%
|
||||
|
@ -550,7 +550,7 @@ who %>%
|
|||
|
||||
## Non-tidy data
|
||||
|
||||
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures:
|
||||
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well-founded data structures that are not tidy data. There are two main reasons to use other data structures:
|
||||
|
||||
* Alternative representations may have substantial performance or space
|
||||
advantages.
|
||||
|
|
|
@ -106,7 +106,7 @@ near(1 / 49 * 49, 1)
|
|||
|
||||
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
|
||||
|
||||
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right hand circle, and the shaded region show which parts each operator selects."}
|
||||
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects."}
|
||||
knitr::include_graphics("diagrams/transform-logical.png")
|
||||
```
|
||||
|
||||
|
@ -192,7 +192,7 @@ filter(df, is.na(x) | x > 1)
|
|||
|
||||
1. Find all flights that
|
||||
|
||||
1. Had an arrival delay of two or more hours.
|
||||
1. Had an arrival delay of two or more hours
|
||||
1. Flew to Houston (`IAH` or `HOU`)
|
||||
1. Were operated by United, American, or Delta
|
||||
1. Departed in summer (July, August, and September)
|
||||
|
@ -675,7 +675,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
|
||||
or standard deviation or sd for short, is the standard measure of spread.
|
||||
The interquartile range `IQR()` and median absolute deviation `mad(x)`
|
||||
are robust equivalents that maybe more useful if you have outliers.
|
||||
are robust equivalents that may be more useful if you have outliers.
|
||||
|
||||
```{r}
|
||||
# Why is distance to some destinations more variable than to others?
|
||||
|
@ -772,7 +772,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
|
||||
### Grouping by multiple variables
|
||||
|
||||
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:
|
||||
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:
|
||||
|
||||
```{r}
|
||||
daily <- group_by(flights, year, month, day)
|
||||
|
|
|
@ -229,7 +229,7 @@ ggplot(data = mpg)
|
|||
|
||||
If you're still stuck, try the help. You can get help about any R function by running `?function_name` in the console, or selecting the function name and pressing F1 in RStudio. Don't worry if the help doesn't seem that helpful - instead skip down to the examples and look for code that matches what you're trying to do.
|
||||
|
||||
If that doesn't help, carefully read the error message. Sometimes the answer will be buried there! But when you're new to R, the answer might be in the error message but you don't yet know how to understand it. Another great tool is Google: trying googling the error message, as it's likely someone else has had the same problem, and have gotten help online.
|
||||
If that doesn't help, carefully read the error message. Sometimes the answer will be buried there! But when you're new to R, the answer might be in the error message but you don't yet know how to understand it. Another great tool is Google: trying googling the error message, as it's likely someone else has had the same problem, and has gotten help online.
|
||||
|
||||
## Facets
|
||||
|
||||
|
@ -536,7 +536,7 @@ knitr::include_graphics("images/visualization-stats.png")
|
|||
|
||||
### Exercises
|
||||
|
||||
1. In our proportion barchart, we need to set `group = 1`. Why? In other
|
||||
1. In our proportion bar chart, we need to set `group = 1`. Why? In other
|
||||
words, why is this graph not useful?
|
||||
|
||||
```{r, eval = FALSE}
|
||||
|
@ -548,7 +548,7 @@ knitr::include_graphics("images/visualization-stats.png")
|
|||
|
||||
## Position adjustments
|
||||
|
||||
There's one more piece of magic associated with bar charts. You can colour bar chart using either the `colour` aesthetic, or more usefully, `fill`:
|
||||
There's one more piece of magic associated with bar charts. You can colour a bar chart using either the `colour` aesthetic, or more usefully, `fill`:
|
||||
|
||||
```{r fig.width = 3, out.width = "50%", fig.align = "default"}
|
||||
ggplot(data = diamonds) +
|
||||
|
|
|
@ -53,9 +53,9 @@ Make another assignment:
|
|||
this_is_a_really_long_name <- 2.5
|
||||
```
|
||||
|
||||
To inspect this object, try out RStudio's completion facility: type the "this", press TAB, add characters until you have a unique prefix, then press return.
|
||||
To inspect this object, try out RStudio's completion facility: type "this", press TAB, add characters until you have a unique prefix, then press return.
|
||||
|
||||
Ooops, you made a mistake! `this_is_a_really_long_name` should have value 3.5 not 2.5. Use another keyboard short to help you fix it. Type "this" then press Cmd/Ctrl + ↑. That will list all the commands you've typed that start those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.
|
||||
Ooops, you made a mistake! `this_is_a_really_long_name` should have value 3.5 not 2.5. Use another keyboard shortcut to help you fix it. Type "this" then press Cmd/Ctrl + ↑. That will list all the commands you've typed that start those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.
|
||||
|
||||
Make yet another assignment:
|
||||
|
||||
|
|
|
@ -9,7 +9,7 @@ One day you will need to quit R, go do something else and return to your analysi
|
|||
|
||||
## What is real?
|
||||
|
||||
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real". However, in the long-run, you'll be much better off if you consider your R scripts as "real".
|
||||
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real". However, in the long run, you'll be much better off if you consider your R scripts as "real".
|
||||
|
||||
With your R scripts (and your data files), you can recreate the environment. It's much harder to recreate your R scripts from your environment! You'll either have to retype a lot of code from memory (making mistakes all the way) or you'll have to carefully mine your R history.
|
||||
|
||||
|
@ -43,9 +43,9 @@ getwd()
|
|||
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
|
||||
```
|
||||
|
||||
As a beginning R user, it's OK let your home directory or any other weird directory on your computer be R's working directory. But you're six chapters into this book, and you're no longer a rank beginner. Very soon now you should evolve to organising your analytical projects into directories and, when working on project A, set R's working directory to the associated directory.
|
||||
As a beginning R user, it's OK let your home directory or any other weird directory on your computer be R's working directory. But you're six chapters into this book, and you're no longer a rank beginner. Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
|
||||
|
||||
__I do not recommend it__, but you also set the working directory from within R:
|
||||
__I do not recommend it__, but you can also set the working directory from within R:
|
||||
|
||||
```{r eval = FALSE}
|
||||
setwd("~/myCoolProject")
|
||||
|
@ -91,12 +91,12 @@ write_csv(diamonds, "diamonds.csv")
|
|||
|
||||
Quit RStudio. Inspect the folder associated with your project --- notice the `.Rproj` file. Double-click that file to re-open the project. Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open. Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
|
||||
|
||||
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but _also the script that created it_ (`diamonds.r`). This is huge win! One day you will want to remake a figure or just understand where it came from. If you rigorously save figures to file __with R code__ and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
|
||||
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but _also the script that created it_ (`diamonds.r`). This is huge win! One day you will want to remake a figure or just understand where it came from. If you rigorously save figures to files __with R code__ and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
|
||||
|
||||
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
|
||||
|
||||
* Create an RStudio project for each data analysis project.
|
||||
* Keep data files there; we'll talk about importing a bit later in [import].
|
||||
* Create an RStudio project for each data analyis project.
|
||||
* Keep data files there; we'll talk about a bit later importing in [data import].
|
||||
* Keep scripts there; edit them, run them in bits or as a whole.
|
||||
* Save your outputs there.
|
||||
|
||||
|
|
|
@ -1,12 +1,12 @@
|
|||
# Workflow: scripts
|
||||
|
||||
So far you've been using the console to run code. That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes. To give yourself more room to work, it's a great idea to use the script editor. Open it up either clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you'll see four panes:
|
||||
So far you've been using the console to run code. That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes. To give yourself more room to work, it's a great idea to use the script editor. Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you'll see four panes:
|
||||
|
||||
```{r echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/rstudio-editor.png")
|
||||
```
|
||||
|
||||
The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it's a good idea to regularly save your scripts and to back them up.
|
||||
The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that work and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it's a good idea to save your scripts regularly and to back them up.
|
||||
|
||||
## Running code
|
||||
|
||||
|
@ -26,9 +26,9 @@ not_cancelled %>%
|
|||
|
||||
Instead of running expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S. Doing this regularly is a great way to check that you've captured all the important parts of your code in the script.
|
||||
|
||||
I recommend that you always start your script with the packages that you need. That way, if you share you code with others, they can easily see what packages they need to install.
|
||||
I recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see what packages they need to install.
|
||||
|
||||
When working through future chapters, I highly recommend starting in the editor and practicing your the keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won't even think about it.
|
||||
When working through future chapters, I highly recommend starting in the editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won't even think about it.
|
||||
|
||||
## RStudio diagnostics
|
||||
|
||||
|
|
|
@ -32,7 +32,7 @@ Data wrangling also encompasses data transformation, which you've already learne
|
|||
manipulating strings.
|
||||
|
||||
* [Factors] are how R stores categorical data. They are used when a variable
|
||||
has a fixed set of possible values, or when you want to non-alphabetical
|
||||
has a fixed set of possible values, or when you want to use a non-alphabetical
|
||||
ordering of a string.
|
||||
|
||||
* [Dates and times] will give you the key tools for working with
|
||||
|
|
Loading…
Reference in New Issue