Merge branch 'master' of github.com:hadley/r4ds
This commit is contained in:
commit
b2de993c5e
2
EDA.Rmd
2
EDA.Rmd
|
@ -494,7 +494,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
|||
1. Visualise the distribution of carat, partitioned by price.
|
||||
|
||||
1. How does the price distribution of very large diamonds compare to small
|
||||
diamonds. Is it as you expect, or does it surprise you?
|
||||
diamonds? Is it as you expect, or does it surprise you?
|
||||
|
||||
1. Combine two of the techniques you've learned to visualise the
|
||||
combined distribution of cut, carat, and price.
|
||||
|
|
|
@ -384,7 +384,7 @@ one_pm
|
|||
one_pm + ddays(1)
|
||||
```
|
||||
|
||||
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if add a full days worth of seconds we end up with a different time.
|
||||
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.
|
||||
|
||||
### Periods
|
||||
|
||||
|
@ -538,7 +538,7 @@ x1 - x2
|
|||
x1 - x3
|
||||
```
|
||||
|
||||
Unless other specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:
|
||||
Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:
|
||||
|
||||
```{r}
|
||||
x4 <- c(x1, x2, x3)
|
||||
|
|
|
@ -209,8 +209,8 @@ Another type of reordering is useful when you are colouring the lines on a plot.
|
|||
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
|
||||
by_age <- gss_cat %>%
|
||||
filter(!is.na(age)) %>%
|
||||
group_by(age, marital) %>%
|
||||
count() %>%
|
||||
count(age, marital) %>%
|
||||
group_by(age) %>%
|
||||
mutate(prop = n / sum(n))
|
||||
|
||||
ggplot(by_age, aes(age, prop, colour = marital)) +
|
||||
|
|
|
@ -280,7 +280,7 @@ if (condition) {
|
|||
|
||||
To get help on `if` you need to surround it in backticks: `` ?`if` ``. The help isn't particularly helpful if you're not already an experienced programmer, but at least you know how to get to it!
|
||||
|
||||
Here's a simple function that uses an if statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
|
||||
Here's a simple function that uses an `if` statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
|
||||
|
||||
```{r}
|
||||
has_name <- function(x) {
|
||||
|
|
|
@ -158,7 +158,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
|
|||
|
||||
## For loop variations
|
||||
|
||||
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've master the FP techniques you'll learn about in the next section.
|
||||
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've mastered the FP techniques you'll learn about in the next section.
|
||||
|
||||
There are four variations on the basic theme of the for loop:
|
||||
|
||||
|
|
|
@ -51,7 +51,7 @@ There are lots of high-level helpers to do these resampling methods in R. We're
|
|||
|
||||
<http://topepo.github.io/caret>. [Applied Predictive Modeling](https://amzn.com/1461468485), by Max Kuhn and Kjell Johnson.
|
||||
|
||||
If you're competing in competitions, like Kaggle, that are predominantly about creating good predicitons, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.
|
||||
If you're competing in competitions, like Kaggle, that are predominantly about creating good predictions, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.
|
||||
|
||||
There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.
|
||||
|
||||
|
@ -155,7 +155,7 @@ models %>%
|
|||
|
||||
But do you think this model will do well if we apply it to new data from the same population?
|
||||
|
||||
In real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.
|
||||
In real-life you can't easily go out and recollect your data. There are two approaches to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.
|
||||
|
||||
```{r}
|
||||
boot <- bootstrap(df, 100) %>%
|
||||
|
@ -181,7 +181,7 @@ last_plot() +
|
|||
|
||||
Bootstrapping is a useful tool to help us understand how the model might vary if we'd collected a different sample from the population. A related technique is cross-validation which allows us to explore the quality of the model. It works by repeatedly splitting the data into two pieces. One piece, the training set, is used to fit, and the other piece, the test set, is used to measure the model quality.
|
||||
|
||||
The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evalute the error on the test set:
|
||||
The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evaluate the error on the test set:
|
||||
|
||||
```{r}
|
||||
cv <- crossv_mc(df, 100) %>%
|
||||
|
@ -192,7 +192,7 @@ cv <- crossv_mc(df, 100) %>%
|
|||
cv
|
||||
```
|
||||
|
||||
Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and teseting), and you can see it's very optimistic.
|
||||
Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and testing), and you can see it's very optimistic.
|
||||
|
||||
```{r}
|
||||
cv %>%
|
||||
|
@ -202,7 +202,7 @@ cv %>%
|
|||
geom_rug()
|
||||
```
|
||||
|
||||
The distribution of errors is highly skewed: there are a few cases which have very high errors. These respresent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look:
|
||||
The distribution of errors is highly skewed: there are a few cases which have very high errors. These represent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look:
|
||||
|
||||
```{r}
|
||||
filter(cv, rmse > 1.5) %>%
|
||||
|
@ -214,13 +214,13 @@ filter(cv, rmse > 1.5) %>%
|
|||
|
||||
All of the models that fit particularly poorly were fit to samples that either missed the first one or two or the last one or two observation. Because polynomials shoot off to positive and negative, they give very bad predictions for those values.
|
||||
|
||||
Now that we've given you a quick overview and intuition for these techniques, lets dive in more more detail.
|
||||
Now that we've given you a quick overview and intuition for these techniques, let's dive in more detail.
|
||||
|
||||
## Resamples
|
||||
|
||||
### Building blocks
|
||||
|
||||
Both the boostrap and cross-validation are build on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions.
|
||||
Both the boostrap and cross-validation are built on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions.
|
||||
|
||||
These functions return an object of class "resample", which represents the resample in a memory efficient way. Instead of storing the resampled dataset itself, it instead stores the integer indices, and a "pointer" to the original dataset. This makes resamples take up much less memory.
|
||||
|
||||
|
@ -250,7 +250,7 @@ If you get a strange error, it's probably because the modelling function doesn't
|
|||
```
|
||||
|
||||
`strap` gives the bootstrap sample dataset, and `.id` assigns a
|
||||
unique identifer to each model (this is often useful for plotting)
|
||||
unique identifier to each model (this is often useful for plotting)
|
||||
|
||||
* `crossv_mc()` return a data frame with three columns:
|
||||
|
||||
|
@ -290,7 +290,7 @@ It's called the $R^2$ because for simple models like this, it's just the square
|
|||
cor(heights$income, heights$height) ^ 2
|
||||
```
|
||||
|
||||
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're asssessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
|
||||
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're assessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -10,7 +10,7 @@ There are two parts to a model:
|
|||
|
||||
1. First, you define a __family of models__ that express a precise, but
|
||||
generic, pattern that you want to capture. For example, the pattern
|
||||
might be a straight line, or a quadatric curve. You will express
|
||||
might be a straight line, or a quadratic curve. You will express
|
||||
the model family as an equation like `y = a_1 * x + a_2` or
|
||||
`y = a_1 * x ^ a_2`. Here, `x` and `y` are known variables from your
|
||||
data, and `a_1` and `a_2` are parameters that can vary to capture
|
||||
|
@ -185,7 +185,7 @@ ggplot(sim1, aes(x, y)) +
|
|||
|
||||
Don't worry too much about the details of how `optim()` works. It's the intuition that's important here. If you have a function that defines the distance between a model and a dataset, an algorithm that can minimise that distance by modifying the parameters of the model, you can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for.
|
||||
|
||||
There's one more approach that we can use for this model, because it's is a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output:
|
||||
There's one more approach that we can use for this model, because it's a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output:
|
||||
|
||||
```{r}
|
||||
sim1_mod <- lm(y ~ x, data = sim1)
|
||||
|
@ -214,7 +214,7 @@ These are exactly the same values we got with `optim()`! Behind the scenes `lm()
|
|||
|
||||
```{r}
|
||||
measure_distance <- function(mod, data) {
|
||||
diff <- data$y - make_prediction(mod, data)
|
||||
diff <- data$y - model1(mod, data)
|
||||
mean(abs(diff))
|
||||
}
|
||||
```
|
||||
|
|
|
@ -165,10 +165,10 @@ Nothing really jumps out at me here, but it's probably worth spending time consi
|
|||
the relationship between `price` and `carat`?
|
||||
|
||||
1. Extract the diamonds that have very high and very low residuals.
|
||||
Is there anything unusual about these diamonds? Are the particularly bad
|
||||
Is there anything unusual about these diamonds? Are they particularly bad
|
||||
or good, or do you think these are pricing errors?
|
||||
|
||||
1. Does the final model, `mod_diamonds2`, do a good job of predicting
|
||||
1. Does the final model, `mod_diamond2`, do a good job of predicting
|
||||
diamond prices? Would you trust it to tell you how much to spend
|
||||
if you were buying a diamond?
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
|
|||
because once you have tidy data, you can apply all of the techniques that
|
||||
you've learned about earlier in the book.
|
||||
|
||||
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
|
||||
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
|
||||
|
||||
The following sections will dive into more detail about the individual techniques:
|
||||
|
||||
|
@ -107,7 +107,7 @@ by_country
|
|||
|
||||
(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)
|
||||
|
||||
This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
|
||||
This creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
|
||||
|
||||
The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).
|
||||
|
||||
|
@ -266,7 +266,7 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwa
|
|||
1. To create the last plot (showing the data for the countries with the
|
||||
worst model fits), we needed two steps: we created a data frame with
|
||||
one row per country and then semi-joined it to the original dataset.
|
||||
It's possible avoid this join if we use `unnest()` instead of
|
||||
It's possible to avoid this join if we use `unnest()` instead of
|
||||
`unnest(.drop = TRUE)`. How?
|
||||
|
||||
## List-columns
|
||||
|
@ -377,14 +377,14 @@ df %>%
|
|||
unnest()
|
||||
```
|
||||
|
||||
(If you find yourself using this pattern a lot, make sure to check out `tidyr:separate_rows()` which is a wrapper around this common pattern).
|
||||
(If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern).
|
||||
|
||||
Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`:
|
||||
|
||||
```{r}
|
||||
sim <- tribble(
|
||||
~f, ~params,
|
||||
"runif", list(min = -1, max = -1),
|
||||
"runif", list(min = -1, max = 1),
|
||||
"rnorm", list(sd = 5),
|
||||
"rpois", list(lambda = 10)
|
||||
)
|
||||
|
|
|
@ -14,7 +14,7 @@ library(magrittr)
|
|||
|
||||
## Piping alternatives
|
||||
|
||||
The point of the pipe is to help you write code in a way that easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo:
|
||||
The point of the pipe is to help you write code in a way that is easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo:
|
||||
|
||||
> Little bunny Foo Foo
|
||||
> Went hopping through the forest
|
||||
|
@ -127,11 +127,11 @@ Finally, we can use the pipe:
|
|||
```{r, eval = FALSE}
|
||||
foo_foo %>%
|
||||
hop(through = forest) %>%
|
||||
scoop(up = field_mouse) %>%
|
||||
scoop(up = field_mice) %>%
|
||||
bop(on = head)
|
||||
```
|
||||
|
||||
This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share you code with others who aren't familiar with the pipe, you can easily teach them.
|
||||
This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share your code with others who aren't familiar with the pipe, you can easily teach them.
|
||||
|
||||
The pipe works by performing a "lexical transformation": behind the scenes, magrittr reassembles the code in the pipe to a form that works by overwriting an intermediate object. When you run a pipe like the one above, magrittr does something like this:
|
||||
|
||||
|
|
|
@ -368,7 +368,7 @@ So far, the pairs of tables have always been joined by a single variable, and th
|
|||
variables from `x` will be used in the output.
|
||||
|
||||
For example, if we want to draw a map we need to combine the flights data
|
||||
with the airports data which contains the location (`lat` and `long`) of
|
||||
with the airports data which contains the location (`lat` and `lon`) of
|
||||
each airport. Each flight has an origin and destination `airport`, so we
|
||||
need to specify which one we want to join to:
|
||||
|
||||
|
|
|
@ -336,7 +336,7 @@ You can control many other "whole document" settings by tweaking the parameters
|
|||
|
||||
R Markdown documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the `params` field.
|
||||
|
||||
This example use a `my_class` parameter to determines which class of cars to display:
|
||||
This example uses a `my_class` parameter to determine which class of cars to display:
|
||||
|
||||
```{r, echo = FALSE, out.width = "100%", comment = ""}
|
||||
cat(readr::read_file("rmarkdown/fuel-economy.Rmd"))
|
||||
|
@ -428,5 +428,5 @@ There are two important topics that we haven't covered here: collaboration, and
|
|||
1. The "Git and GitHub" chapter of _R Packages_, by Hadley. You can also
|
||||
read it for free online: <http://r-pkgs.had.co.nz/git.html>.
|
||||
|
||||
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <http://georgegopen.com/articles/litigation/>. They are aimed at lawyers, but almost everything applies to data scientists too.
|
||||
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <https://www.georgegopen.com/the-litigation-articles.html>. They are aimed at lawyers, but almost everything applies to data scientists too.
|
||||
|
||||
|
|
17
strings.Rmd
17
strings.Rmd
|
@ -299,6 +299,17 @@ There are a number of special patterns that match more than one character. You'v
|
|||
|
||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||
|
||||
A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.
|
||||
|
||||
```{r}
|
||||
# Look for a literal character that normally has special meaning in a regex
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
|
||||
```
|
||||
|
||||
This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
|
||||
|
||||
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
|
@ -389,7 +400,7 @@ str_view(x, 'C[LX]+?')
|
|||
|
||||
### Grouping and backreferences
|
||||
|
||||
Earlier, you learned about parentheses as a way to disambiguate complex expressions. They also define "groups" that you can refer to with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.
|
||||
Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a _numbered_ capturing group (number 1, 2 etc.). A capturing group stores _the part of the string_ matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.
|
||||
|
||||
```{r}
|
||||
str_view(fruit, "(..)\\1", match = TRUE)
|
||||
|
@ -519,7 +530,7 @@ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
|
|||
?:\r\n)?[ \t])*))*)?;\s*)
|
||||
```
|
||||
|
||||
This is a somewhat pathological example (because email addresses are actually suprisingly complex), but is used in real code. See the stackoverflow discussion at <http://stackoverflow.com/a/201378> for more details.
|
||||
This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the stackoverflow discussion at <http://stackoverflow.com/a/201378> for more details.
|
||||
|
||||
Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
|
||||
|
@ -853,7 +864,7 @@ You can use the other arguments of `regex()` to control details of the match:
|
|||
phone <- regex("
|
||||
\\(? # optional opening parens
|
||||
(\\d{3}) # area code
|
||||
[)- ]? # optional closing parens, dash, or space
|
||||
[) -]? # optional closing parens, space, or dash
|
||||
(\\d{3}) # another three numbers
|
||||
[ -]? # optional space or dash
|
||||
(\\d{3}) # three more numbers
|
||||
|
|
13
tidy.Rmd
13
tidy.Rmd
|
@ -184,13 +184,14 @@ To tidy this up, we first analyse the representation in similar way to `gather()
|
|||
* The column that contains variable names, the `key` column. Here, it's
|
||||
`type`.
|
||||
|
||||
* The column that contains values forms multiple variables, the `value`
|
||||
* The column that contains values from multiple variables, the `value`
|
||||
column. Here it's `count`.
|
||||
|
||||
Once we've figured that out, we can use `spread()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-spread).
|
||||
|
||||
```{r}
|
||||
spread(table2, key = type, value = count)
|
||||
table2 %>%
|
||||
spread(key = type, value = count)
|
||||
```
|
||||
|
||||
```{r tidy-spread, echo = FALSE, out.width = "100%", fig.cap = "Spreading `table2` makes it tidy"}
|
||||
|
@ -285,7 +286,7 @@ table3 %>%
|
|||
|
||||
(Formally, `sep` is a regular expression, which you'll learn more about in [strings].)
|
||||
|
||||
Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
|
||||
Look carefully at the column types: you'll notice that `cases` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
|
||||
|
||||
```{r}
|
||||
table3 %>%
|
||||
|
@ -525,9 +526,9 @@ I've shown you the code a piece at a time, assigning each interim result to a ne
|
|||
|
||||
```{r, results = "hide"}
|
||||
who %>%
|
||||
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
|
||||
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
|
||||
separate(code, c("new", "var", "sexage")) %>%
|
||||
gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
|
||||
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
|
||||
separate(key, c("new", "var", "sexage")) %>%
|
||||
select(-new, -iso2, -iso3) %>%
|
||||
separate(sexage, c("sex", "age"), sep = 1)
|
||||
```
|
||||
|
|
|
@ -101,7 +101,7 @@ There's another common problem you might encounter when using `==`: floating poi
|
|||
|
||||
```{r}
|
||||
sqrt(2) ^ 2 == 2
|
||||
1/49 * 49 == 1
|
||||
1 / 49 * 49 == 1
|
||||
```
|
||||
|
||||
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on `==`, use `near()`:
|
||||
|
@ -273,7 +273,7 @@ There are a number of helper functions you can use within `select()`:
|
|||
This one matches any variables that contain repeated characters. You'll
|
||||
learn more about regular expressions in [strings].
|
||||
|
||||
* `num_range("x", 1:3)` matches `x1`, `x2` and `x3`.
|
||||
* `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
|
||||
|
||||
See `?select` for more details.
|
||||
|
||||
|
@ -389,7 +389,7 @@ There are many functions for creating new variables that you can use with `mutat
|
|||
|
||||
* Offsets: `lead()` and `lag()` allow you to refer to leading or lagging
|
||||
values. This allows you to compute running differences (e.g. `x - lag(x)`)
|
||||
or find when values change (`x != lag(x))`. They are most useful in
|
||||
or find when values change (`x != lag(x)`). They are most useful in
|
||||
conjunction with `group_by()`, which you'll learn about shortly.
|
||||
|
||||
```{r}
|
||||
|
@ -676,7 +676,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
)
|
||||
```
|
||||
|
||||
* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
|
||||
* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The root mean squared deviation,
|
||||
or standard deviation or sd for short, is the standard measure of spread.
|
||||
The interquartile range `IQR()` and median absolute deviation `mad(x)`
|
||||
are robust equivalents that may be more useful if you have outliers.
|
||||
|
@ -877,12 +877,12 @@ Functions that work most naturally in grouped mutates and filters are known as
|
|||
1. What time of day should you fly if you want to avoid delays as much
|
||||
as possible?
|
||||
|
||||
1. For each destination, compute the total minutes of delay. For each,
|
||||
1. For each destination, compute the total minutes of delay. For each
|
||||
flight, compute the proportion of the total delay for its destination.
|
||||
|
||||
1. Delays are typically temporally correlated: even once the problem that
|
||||
caused the initial delay has been resolved, later flights are delayed
|
||||
to allow earlier flights to leave. Using `lag()` explore how the delay
|
||||
to allow earlier flights to leave. Using `lag()`, explore how the delay
|
||||
of a flight is related to the delay of the immediately preceding flight.
|
||||
|
||||
1. Look at each destination. Can you find flights that are suspiciously
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
So far this book has focussed on tibbles and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles. If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles. I think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
|
||||
|
||||
Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature. I am working on a better approach, <https://github.com/hadley/lazyeval>, but it will not be ready in time for the publication of the book. Even when complete, you'll still need you understand vectors, it'll just make it easier to write a user-friendly layer on top.
|
||||
Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature. I am working on a better approach, <https://github.com/hadley/lazyeval>, but it will not be ready in time for the publication of the book. Even when complete, you'll still need to understand vectors, it'll just make it easier to write a user-friendly layer on top.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -48,7 +48,7 @@ Every vector has two key properties:
|
|||
length(x)
|
||||
```
|
||||
|
||||
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are four important types of augmented vector:
|
||||
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are three important types of augmented vector:
|
||||
|
||||
* Factors are built on top of integer vectors.
|
||||
* Dates and date-times are built on top of numeric vectors.
|
||||
|
@ -194,7 +194,7 @@ There are two ways to convert, or coerce, one type of vector to another:
|
|||
|
||||
Because explicit coercion is used relatively rarely, and is largely easy to understand, I'll focus on implicit coercion here.
|
||||
|
||||
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
|
||||
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to `0`. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
|
||||
|
||||
```{r}
|
||||
x <- sample(20, 100, replace = TRUE)
|
||||
|
|
|
@ -167,7 +167,7 @@ ggplot(data = mpg) +
|
|||
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
|
||||
```
|
||||
|
||||
Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a value that makes sense for that aesthetic:
|
||||
Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a level that makes sense for that aesthetic:
|
||||
|
||||
* The name of a color as a character string.
|
||||
|
||||
|
@ -294,7 +294,7 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
|
|||
|
||||
1. Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other
|
||||
options control the layout of the individual panels? Why doesn't
|
||||
`facet_grid()` have `nrow` and `ncol` argument?
|
||||
`facet_grid()` have `nrow` and `ncol` arguments?
|
||||
|
||||
1. When using `facet_grid()` you should usually put the variable with more
|
||||
unique levels in the columns. Why?
|
||||
|
|
|
@ -146,7 +146,7 @@ Here you can see all of the objects that you've created.
|
|||
```{r, eval = FALSE}
|
||||
library(tidyverse)
|
||||
|
||||
ggplot(dota = mpg) +
|
||||
ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy))
|
||||
|
||||
fliter(mpg, cyl = 8)
|
||||
|
|
Loading…
Reference in New Issue