fix list of typos (#488)
This commit is contained in:
parent
4bb10b9ae4
commit
c81d1e0087
|
@ -338,7 +338,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
|
||||
Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.
|
||||
|
||||
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices][diamond-prices] it's easier to see the precise relationship between `carat` and `price` if we log transform them:
|
||||
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:
|
||||
|
||||
```{r, fig.align = "default", out.width = "50%"}
|
||||
ggplot(diamonds, aes(carat, price)) +
|
||||
|
|
|
@ -182,7 +182,7 @@ Now that you know how to get date-time data into R's date-time data structures,
|
|||
|
||||
### Getting components
|
||||
|
||||
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
|
||||
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
|
||||
|
||||
```{r}
|
||||
datetime <- ymd_hms("2016-07-08 12:34:56")
|
||||
|
@ -477,7 +477,7 @@ To find out how many periods fall into an interval, you need to use integer divi
|
|||
|
||||
How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
|
||||
|
||||
Figure \@(ref:dt-algebra) summarises permitted arithmetic operations between the different data types.
|
||||
Figure \@ref(fig:dt-algebra) summarises permitted arithmetic operations between the different data types.
|
||||
|
||||
```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."}
|
||||
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
|
||||
|
|
|
@ -914,7 +914,7 @@ x %>%
|
|||
|
||||
### Reduce and accumulate
|
||||
|
||||
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
|
||||
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
|
||||
|
||||
```{r}
|
||||
dfs <- list(
|
||||
|
|
|
@ -192,7 +192,7 @@ sim1_mod <- lm(y ~ x, data = sim1)
|
|||
coef(sim1_mod)
|
||||
```
|
||||
|
||||
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model by in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
|
||||
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -488,7 +488,7 @@ Note my use of `seq_range()` inside `data_grid()`. Instead of using every unique
|
|||
```
|
||||
|
||||
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
|
||||
variables has an long tailed distribution and you want to focus on generating
|
||||
variables have a long tailed distribution and you want to focus on generating
|
||||
values near the center:
|
||||
|
||||
```{r}
|
||||
|
@ -552,7 +552,7 @@ model_matrix(df, y ~ x^2 + x)
|
|||
model_matrix(df, y ~ I(x^2) + x)
|
||||
```
|
||||
|
||||
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a linear to get arbitrary close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
|
||||
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
|
||||
|
||||
```{r}
|
||||
model_matrix(df, y ~ poly(x, 2))
|
||||
|
|
|
@ -154,7 +154,7 @@ diamonds2 %>%
|
|||
arrange(price)
|
||||
```
|
||||
|
||||
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are a errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
|
||||
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -385,7 +385,7 @@ Either approach is reasonable. Making the transformed variable explicit is usefu
|
|||
|
||||
### Time of year: an alternative approach
|
||||
|
||||
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
|
||||
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using our knowledge explicitly in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
|
||||
|
||||
```{r}
|
||||
library(splines)
|
||||
|
|
|
@ -13,7 +13,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
|
|||
1. Using the __broom__ package, by David Robinson, to turn models into tidy
|
||||
data. This is a powerful technique for working with large numbers of models
|
||||
because once you have tidy data, you can apply all of the techniques that
|
||||
you've learned about in earlier in the book.
|
||||
you've learned about earlier in the book.
|
||||
|
||||
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
|
||||
|
||||
|
@ -133,7 +133,7 @@ And we want to apply it to every data frame. The data frames are in a list, so w
|
|||
models <- map(by_country$data, country_model)
|
||||
```
|
||||
|
||||
However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
|
||||
However, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
|
||||
|
||||
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:
|
||||
|
||||
|
@ -194,7 +194,7 @@ resids %>%
|
|||
facet_wrap(~continent)
|
||||
```
|
||||
|
||||
It looks like we've missed some mild pattern. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
|
||||
It looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
|
||||
|
||||
### Model quality
|
||||
|
||||
|
|
|
@ -216,7 +216,7 @@ y <- tribble(
|
|||
)
|
||||
```
|
||||
|
||||
The coloured column represents the "key" variable: these are used to match the rows between the tables. The grey column represents the "value" column that is carried along for the ride. In these examples I'll show a single key variable and single value variable, but idea generalises in a straightforward way to multiple keys and multiple values.
|
||||
The coloured column represents the "key" variable: these are used to match the rows between the tables. The grey column represents the "value" column that is carried along for the ride. In these examples I'll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
|
||||
|
||||
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`. The following diagram shows each potential match as an intersection of a pair of lines.
|
||||
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
So far you've seen R Markdown used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with R Markdown. There are two ways to set the output of a document:
|
||||
|
||||
1. Permanently, by modifying the the YAML header:
|
||||
1. Permanently, by modifying the YAML header:
|
||||
|
||||
```yaml
|
||||
title: "Viridis Demo"
|
||||
|
@ -88,7 +88,7 @@ output:
|
|||
|
||||
## Notebooks
|
||||
|
||||
A notebook, `html_notebook`, is a variation on a `html_document`. The rendered outputs are very similar, but the purpose is different. A `html_document` is focussed on communicating with decisions makers, while a notebook is focussed on collaborating with other data scientists. These different purposes lead to using the HTML output in different ways. Both HTML outputs will contain the fully rendered output, but the notebook also contains the full source code. That means you can use the `.nb.html` generated by the notebook in two ways:
|
||||
A notebook, `html_notebook`, is a variation on a `html_document`. The rendered outputs are very similar, but the purpose is different. A `html_document` is focussed on communicating with decision makers, while a notebook is focussed on collaborating with other data scientists. These different purposes lead to using the HTML output in different ways. Both HTML outputs will contain the fully rendered output, but the notebook also contains the full source code. That means you can use the `.nb.html` generated by the notebook in two ways:
|
||||
|
||||
1. You can view it in a web browser, and see the rendered output. Unlike
|
||||
`html_document`, this rendering always includes an embedded copy of
|
||||
|
@ -238,7 +238,7 @@ Other packages provide even more output formats:
|
|||
* The __bookdown__ package, <https://github.com/rstudio/bookdown>,
|
||||
makes it easy to write books, like this one. To learn more, read
|
||||
[_Authoring Books with R Markdown_](https://bookdown.org/yihui/bookdown/),
|
||||
by Yihui Xie, which is, of course, written in bookdown, Visit
|
||||
by Yihui Xie, which is, of course, written in bookdown. Visit
|
||||
<http://www.bookdown.org> to see other bookdown books written by the
|
||||
wider R community.
|
||||
|
||||
|
|
|
@ -70,7 +70,7 @@ knitr::include_graphics("images/RMarkdownFlow.png")
|
|||
|
||||
To get started with your own `.Rmd` file, select *File > New File > R Markdown...* in the menubar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of R Markdown work.
|
||||
|
||||
The following sections dives into the three components of an R Markdown document in more details: the markdown text, the code chunks, and the YAML header.
|
||||
The following sections dive into the three components of an R Markdown document in more details: the markdown text, the code chunks, and the YAML header.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -187,7 +187,7 @@ The most important set of options controls if your code block is executed and wh
|
|||
of your report, but can be very useful if you need to debug exactly
|
||||
what is going on inside your `.Rmd`. It's also useful if you're teaching R
|
||||
and want to deliberately include an error. The default, `error = FALSE` causes
|
||||
knitting to failure if there is a single error in the document.
|
||||
knitting to fail if there is a single error in the document.
|
||||
|
||||
The following table summarises which types of output each option supressess:
|
||||
|
||||
|
@ -220,7 +220,7 @@ knitr::kable(
|
|||
|
||||
Read the documentation for `?knitr::kable` to see the other ways in which you can customise the table. For even deeper customisation, consider the __xtable__, __stargazer__, __pander__, __tables__, and __ascii__ packages. Each provides a set of tools for returning formatted tables from R code.
|
||||
|
||||
There are also a rich set of options for controlling how figures embedded. You'll learn about these in [saving your plots].
|
||||
There is also a rich set of options for controlling how figures are embedded. You'll learn about these in [saving your plots].
|
||||
|
||||
### Caching
|
||||
|
||||
|
@ -260,7 +260,7 @@ I've used the advice of [David Robinson](https://twitter.com/drob/status/7387866
|
|||
|
||||
### Global options
|
||||
|
||||
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. You can do by calling `knitr::opts_chunk$set()` in a code chunk. For example, when writing books and tutorials I set:
|
||||
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. You can do this by calling `knitr::opts_chunk$set()` in a code chunk. For example, when writing books and tutorials I set:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
knitr::opts_chunk$set(
|
||||
|
@ -360,7 +360,7 @@ Alternatively, if you need to produce many such paramterised reports, you can ca
|
|||
rmarkdown::render("fuel-economy.Rmd", params = list(my_class = "suv"))
|
||||
```
|
||||
|
||||
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`. First we create a data frame that has one row for each class, giving the `filename` of report and the `params` it should be given:
|
||||
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`. First we create a data frame that has one row for each class, giving the `filename` of the report and the `params`:
|
||||
|
||||
```{r}
|
||||
reports <- tibble(
|
||||
|
@ -371,7 +371,7 @@ reports <- tibble(
|
|||
reports
|
||||
```
|
||||
|
||||
Then we match the column names to the argument names of `render()`, and use purrr's **parallel* walk to call `render()` once for each row:
|
||||
Then we match the column names to the argument names of `render()`, and use purrr's **parallel** walk to call `render()` once for each row:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
reports %>%
|
||||
|
@ -406,7 +406,7 @@ Smith says blah [-@smith04].
|
|||
|
||||
When R Markdown renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as `# References` or `# Bibliography`.
|
||||
|
||||
You can change the style of your citations and bibliography by reference a CSL (citation style language) file to the `csl` field:
|
||||
You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the `csl` field:
|
||||
|
||||
```yaml
|
||||
bibliography: rmarkdown.bib
|
||||
|
@ -428,5 +428,5 @@ There are two important topics that we haven't covered here: collaboration, and
|
|||
1. The "Git and GitHub" chapter of _R Packages_, by Hadley. You can also
|
||||
read it for free online: <http://r-pkgs.had.co.nz/git.html>.
|
||||
|
||||
I have also not touched about what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <http://georgegopen.com/articles/litigation/>. They are aimed at lawyers, but almost everything applies to data scientists too.
|
||||
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <http://georgegopen.com/articles/litigation/>. They are aimed at lawyers, but almost everything applies to data scientists too.
|
||||
|
||||
|
|
|
@ -521,7 +521,7 @@ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
|
|||
|
||||
This is a somewhat pathological example (because email addresses are actually suprisingly complex), but is used in real code. See the stackoverflow discussion at <http://stackoverflow.com/a/201378> for more details.
|
||||
|
||||
Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
|
||||
### Detect matches
|
||||
|
||||
|
@ -618,7 +618,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
|
|||
|
||||
### Extract matches
|
||||
|
||||
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexes. These are provided in `stringr::sentences`:
|
||||
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexps. These are provided in `stringr::sentences`:
|
||||
|
||||
```{r}
|
||||
length(sentences)
|
||||
|
|
|
@ -175,7 +175,7 @@ Now that you understand the different types of atomic vector, it's useful to rev
|
|||
|
||||
1. How to name the elements of a vector.
|
||||
|
||||
1. How pull out elements of interest.
|
||||
1. How to pull out elements of interest.
|
||||
|
||||
### Coercion
|
||||
|
||||
|
@ -294,7 +294,7 @@ Named vectors are most useful for subsetting, described next.
|
|||
|
||||
### Subsetting {#vector-subsetting}
|
||||
|
||||
So far we've used `dplyr::filter()` to filter the rows in a tibble. `filter()` only works with tibble, so we'll need new tool for vectors: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of thing that you can subset a vector with:
|
||||
So far we've used `dplyr::filter()` to filter the rows in a tibble. `filter()` only works with tibble, so we'll need new tool for vectors: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of things that you can subset a vector with:
|
||||
|
||||
1. A numeric vector containing only integers. The integers must either be all
|
||||
positive, all negative, or zero.
|
||||
|
@ -553,7 +553,7 @@ There are three very important attributes that are used to implement fundamental
|
|||
1. __Dimensions__ (dims, for short) make a vector behave like a matrix or array.
|
||||
1. __Class__ is used to implement the S3 object oriented system.
|
||||
|
||||
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions__ work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it _Advanced R_ at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
|
||||
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions__ work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in _Advanced R_ at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
|
||||
|
||||
Here's what a typical generic function looks like:
|
||||
|
||||
|
@ -580,7 +580,7 @@ The most important S3 generic is `print()`: it controls how the object is printe
|
|||
|
||||
## Augmented vectors
|
||||
|
||||
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__, including class. Because augmented vectors has a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:
|
||||
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__, including class. Because augmented vectors have a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:
|
||||
|
||||
* Factors.
|
||||
* Date-times and times.
|
||||
|
|
|
@ -473,7 +473,7 @@ On the x-axis, the chart displays `cut`, a variable from `diamonds`. On the y-ax
|
|||
* boxplots compute a robust summary of the distribution and display as
|
||||
specially formatted box.
|
||||
|
||||
The algorithm used calculate new values for a graph is called a __stat__, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.
|
||||
The algorithm used to calculate new values for a graph is called a __stat__, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.
|
||||
|
||||
```{r, echo = FALSE, out.width = "100%"}
|
||||
knitr::include_graphics("images/visualization-stat-bar.png")
|
||||
|
|
Loading…
Reference in New Issue