Local bookdown working
This commit is contained in:
parent
bad4c9d975
commit
8e40393cf5
|
@ -8,3 +8,5 @@ temp.Rmd
|
|||
*_files
|
||||
figures
|
||||
.Rapp.history
|
||||
_main.Rmd
|
||||
book_assets
|
||||
|
|
|
@ -31,7 +31,8 @@ Imports:
|
|||
Remotes:
|
||||
gaborcsardi/rcorpora,
|
||||
garrettgman/DSR,
|
||||
hadley/bookdown,
|
||||
hadley/purrr,
|
||||
hadley/stringr,
|
||||
hadley/ggplot2
|
||||
hadley/ggplot2,
|
||||
rstudio/bookdown,
|
||||
yihui/knitr
|
||||
|
|
|
@ -3,7 +3,14 @@
|
|||
This is code and text behind the [R for data science](http://r4ds.had.co.nz)
|
||||
book.
|
||||
|
||||
The site is built using jekyll, with a custom plugin to render `.rmd` files with
|
||||
The site is built using [bookdown]
|
||||
|
||||
```{r}
|
||||
devtools::install_github("yihui/knitr")
|
||||
devtools::install_github("rstudio/bookdown")
|
||||
```
|
||||
|
||||
jekyll, with a custom plugin to render `.rmd` files with
|
||||
knitr and pandoc. To create the site, you need:
|
||||
|
||||
* jekyll gem: `gem install jekyll`
|
||||
|
|
23
_config.yml
23
_config.yml
|
@ -1,5 +1,20 @@
|
|||
name: R for data science
|
||||
markdown: redcarpet
|
||||
highlighter: pygments
|
||||
rmd_files: [
|
||||
"index.Rmd",
|
||||
"intro.Rmd",
|
||||
"visualize.Rmd",
|
||||
"transform.Rmd",
|
||||
"tidy.Rmd",
|
||||
"model.Rmd",
|
||||
"import.Rmd",
|
||||
"eda.Rmd",
|
||||
"rmarkdown.Rmd",
|
||||
"shiny.Rmd",
|
||||
"data-structures.Rmd",
|
||||
"functions.Rmd",
|
||||
"strings.Rmd",
|
||||
"datetimes.Rmd",
|
||||
"lists.Rmd",
|
||||
"model-vis.Rmd",
|
||||
"model-assess.Rmd",
|
||||
]
|
||||
|
||||
exclude: ["CONTRIBUTING.md", "README.md", "book", "vendor"]
|
||||
|
|
|
@ -1,9 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Data structures
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Data structures
|
||||
|
||||
Might be quite brief.
|
||||
|
|
|
@ -1,7 +1 @@
|
|||
---
|
||||
layout: default
|
||||
title: Dates and times
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Dates and times
|
||||
|
|
23
eda.Rmd
23
eda.Rmd
|
@ -1,9 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Exploratory data analysis
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Exploratory data analysis
|
||||
|
||||
```{r, include = FALSE}
|
||||
|
@ -82,6 +76,7 @@ ggplot(data = diamonds) +
|
|||
***
|
||||
|
||||
*Tip*: You can compute the counts of a discrete variable quickly with R's `table()` function. These are the numbers that `geom_bar()` visualizes.
|
||||
|
||||
```{r}
|
||||
table(diamonds$cut)
|
||||
```
|
||||
|
@ -94,19 +89,27 @@ The strategy of counting the number of observations at each value breaks down fo
|
|||
|
||||
To get around this, data scientists divide the range of a continuous variable into equally spaced intervals, a process called _binning_.
|
||||
|
||||
`r bookdown::embed_png("images/visualization-17.png", dpi = 300)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/visualization-17.png")
|
||||
```
|
||||
|
||||
They then count how many observations fall into each bin.
|
||||
|
||||
`r bookdown::embed_png("images/visualization-18.png", dpi = 300)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/visualization-18.png")
|
||||
```
|
||||
|
||||
And display the count as a bar, or some other object.
|
||||
|
||||
`r bookdown::embed_png("images/visualization-19.png", dpi = 300)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/visualization-19.png")
|
||||
```
|
||||
|
||||
This method is temperamental because the appearance of the distribution can change dramatically if the bin size changes. As no bin size is "correct," you should explore several bin sizes when examining data.
|
||||
|
||||
`r bookdown::embed_png("images/visualization-20.png", dpi = 300)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/visualization-20.png")
|
||||
```
|
||||
|
||||
Several geoms exist to help you visualize continuous distributions. They almost all use the "bin" stat to implement the above strategy. For each of these geoms, you can set the following arguments for "bin" to use:
|
||||
|
||||
|
|
|
@ -1,9 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Expressing yourself
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Expressing yourself in code
|
||||
|
||||
```{r, include = FALSE}
|
||||
|
|
|
@ -1,9 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Data import
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Data import
|
||||
|
||||
```{r, include = FALSE}
|
||||
|
|
13
index.rmd
13
index.rmd
|
@ -1,7 +1,11 @@
|
|||
---
|
||||
layout: default
|
||||
title: Welcome
|
||||
output: bookdown::html_chapter
|
||||
|
||||
knit: "bookdown::render_book"
|
||||
output:
|
||||
bookdown::html_chapters:
|
||||
lib_dir: "book_assets"
|
||||
---
|
||||
|
||||
# R for Data Science
|
||||
|
@ -11,10 +15,3 @@ This is the book site for __"R for data science"__. This book will teach you how
|
|||
To be published by O'Reilly in July 2016.
|
||||
|
||||
<img src="cover.png" width="250" height="328" alt="Cover image" />
|
||||
|
||||
## Table of contents {#toc}
|
||||
|
||||
<ul class="toc">
|
||||
{% include package-nav.html %}
|
||||
</ul>
|
||||
|
||||
|
|
22
intro.Rmd
22
intro.Rmd
|
@ -1,12 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Welcome
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
# Introduction
|
||||
|
||||
# Welcome
|
||||
|
||||
```{r setup, include = FALSE}
|
||||
```{r setup-intro, include = FALSE}
|
||||
source("common.R")
|
||||
install.packages <- function(...) invisible()
|
||||
```
|
||||
|
@ -17,7 +11,9 @@ Data science is an exciting discipline that allows you to turn raw data into und
|
|||
|
||||
Data science is a huge field, and there's no way you can master it by reading a single book. The goal of this book is to give you a solid foundation with the most important tools. Our model of the tools needed in a typical data science project looks something like this:
|
||||
|
||||
`r bookdown::embed_png("diagrams/data-science.png")`
|
||||
```{r}
|
||||
knitr::include_graphics("diagrams/data-science.png")
|
||||
```
|
||||
|
||||
First you must __import__ your data in R. This typically means that you take data stored in file, in a database, or in an web API, and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!
|
||||
|
||||
|
@ -108,7 +104,9 @@ To run the code in this book, you will need to install both R and the RStudio ID
|
|||
|
||||
RStudio is an integated development environment, or IDE, for R programming. There are three key regions:
|
||||
|
||||
`r bookdown::embed_png("screenshots/rstudio-layout.png", dpi = 220)`
|
||||
```{r}
|
||||
knitr::include_graphics("screenshots/rstudio-layout.png")
|
||||
```
|
||||
|
||||
You run R code in the __console__ pane. Textual output appears inline, and graphical output appears in the __output__ pane. You write more complex R scripts in the __editor__ pane.
|
||||
|
||||
|
@ -126,7 +124,9 @@ If you want to see a list of all keyboard shortcuts, use the meta keyboard short
|
|||
|
||||
We strongly recommend making two changes to the default RStudio options:
|
||||
|
||||
`r bookdown::embed_png("screenshots/rstudio-workspace.png", dpi = 220)`
|
||||
```{r}
|
||||
knitr::include_graphics("screenshots/rstudio-workspace.png")
|
||||
```
|
||||
|
||||
This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
|
||||
|
||||
|
|
49
lists.Rmd
49
lists.Rmd
|
@ -1,15 +1,8 @@
|
|||
---
|
||||
layout: default
|
||||
title: Working with lists
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Lists
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
```{r setup-lists, include=FALSE}
|
||||
library(purrr)
|
||||
source("common.R")
|
||||
source("images/embed_jpg.R")
|
||||
```
|
||||
|
||||
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
|
||||
|
@ -82,7 +75,9 @@ x3 <- list(1, list(2, list(3)))
|
|||
|
||||
I draw them as follows:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-structure.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-structure.png")
|
||||
```
|
||||
|
||||
* Lists are rounded rectangles that contain their children.
|
||||
|
||||
|
@ -129,20 +124,22 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
|
|||
|
||||
Or visually:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-subsetting.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-subsetting.png")
|
||||
```
|
||||
|
||||
### Lists of condiments
|
||||
|
||||
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember these differences:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
embed_jpg("images/pepper.jpg", 300)
|
||||
knitr::include_graphics("images/pepper.jpg")
|
||||
```
|
||||
|
||||
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
embed_jpg("images/pepper-1.jpg", 300)
|
||||
knitr::include_graphics("images/pepper-1.jpg")
|
||||
```
|
||||
|
||||
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
|
||||
|
@ -150,13 +147,13 @@ embed_jpg("images/pepper-1.jpg", 300)
|
|||
`x[[1]]` is:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
embed_jpg("images/pepper-2.jpg", 300)
|
||||
knitr::include_graphics("images/pepper-2.jpg")
|
||||
```
|
||||
|
||||
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
embed_jpg("images/pepper-3.jpg", 300)
|
||||
knitr::include_graphics("images/pepper-3.jpg")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
@ -508,7 +505,9 @@ flatten_dbl(y)
|
|||
|
||||
Graphically, that sequence of operations looks like:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-flatten.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-flatten.png")
|
||||
````
|
||||
|
||||
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
|
||||
|
||||
|
@ -529,7 +528,9 @@ x %>% transpose() %>% str()
|
|||
|
||||
Graphically, this looks like:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-transpose.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-transpose.png")
|
||||
```
|
||||
|
||||
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
|
||||
|
||||
|
@ -638,7 +639,9 @@ map2(mu, sigma, rnorm, n = 10)
|
|||
|
||||
`map2()` generates this series of function calls:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-map2.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-map2.png")
|
||||
```
|
||||
|
||||
The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
|
||||
|
||||
|
@ -664,7 +667,9 @@ args1 %>% pmap(rnorm) %>% str()
|
|||
|
||||
That looks like:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-pmap-unnamed.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
|
||||
```
|
||||
|
||||
However, instead of relying on position matching, it's better to name the arguments. This is more verbose, but it makes the code clearer.
|
||||
|
||||
|
@ -675,7 +680,9 @@ args2 %>% pmap(rnorm) %>% str()
|
|||
|
||||
That generates longer, but safer, calls:
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-pmap-named.png", dpi = 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-pmap-named.png")
|
||||
```
|
||||
|
||||
Since the arguments are all the same length, it makes sense to store them in a data frame:
|
||||
|
||||
|
@ -706,7 +713,9 @@ To handle this case, you can use `invoke_map()`:
|
|||
invoke_map(f, param, n = 5) %>% str()
|
||||
```
|
||||
|
||||
`r bookdown::embed_png("diagrams/lists-invoke.png")`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-invoke.png")
|
||||
```
|
||||
|
||||
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
|
||||
|
||||
|
|
|
@ -1,12 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Model assessment
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Model assessment
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
```{r setup-model, include=FALSE}
|
||||
library(purrr)
|
||||
set.seed(1014)
|
||||
options(digits = 3)
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Models and visualisation
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
# Model visualisation
|
||||
|
||||
Gap minder
|
||||
|
|
|
@ -1,9 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Model
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Model
|
||||
|
||||
After reading this chapter, what can you do that you couldn't before?
|
||||
|
|
|
@ -1,11 +1,4 @@
|
|||
---
|
||||
layout: default
|
||||
title: R Markdown
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# RMarkdown
|
||||
|
||||
# R Markdown
|
||||
|
||||
Recommendations for learning more about communication:
|
||||
|
||||
|
|
|
@ -1,7 +1 @@
|
|||
---
|
||||
layout: default
|
||||
title: Shiny
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Shiny
|
||||
|
|
43
strings.Rmd
43
strings.Rmd
|
@ -1,13 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: String manipulation
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# String manipulation
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE)
|
||||
```{r setup-strings, include = FALSE}
|
||||
library(stringr)
|
||||
|
||||
common <- rcorpora::corpora("words/common")$commonWords
|
||||
|
@ -71,8 +64,8 @@ str_length(NA)
|
|||
|
||||
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
|
||||
|
||||
```{r}
|
||||
bookdown::embed_png("screenshots/stringr-autocomplete.png", dpi = 220)
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
||||
```
|
||||
|
||||
### Combining strings
|
||||
|
@ -199,20 +192,20 @@ To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These
|
|||
|
||||
The simplest patterns match exact strings:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "an")
|
||||
```
|
||||
|
||||
The next step up in complexity is `.`, which matches any character (except a new line):
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
str_view(x, ".a.")
|
||||
```
|
||||
|
||||
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
# To create the regular expression, we need \\
|
||||
dot <- "\\."
|
||||
|
||||
|
@ -225,7 +218,7 @@ str_view(c("abc", "a.c", "bef"), "a\\.c")
|
|||
|
||||
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- "a\\b"
|
||||
writeLines(x)
|
||||
|
||||
|
@ -250,7 +243,7 @@ By default, regular expressions will match any part of a string. It's often usef
|
|||
* `^` to match the start of the string.
|
||||
* `$` to match the end of the string.
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "^a")
|
||||
str_view(x, "a$")
|
||||
|
@ -260,7 +253,7 @@ To remember which is which, try this mneomic which I learned from [Evan Misshula
|
|||
|
||||
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- c("apple pie", "apple", "apple cake")
|
||||
str_view(x, "apple")
|
||||
str_view(x, "^apple$")
|
||||
|
@ -301,13 +294,13 @@ Remember, to create a regular expression containing `\d` or `\s`, you'll need to
|
|||
|
||||
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
str_view(c("abc", "xyz"), "abc|xyz")
|
||||
```
|
||||
|
||||
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
str_view(c("grey", "gray"), "gr(e|a)y")
|
||||
```
|
||||
|
||||
|
@ -373,7 +366,7 @@ Note that the precedence of these operators are high, so you can write: `colou?r
|
|||
|
||||
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
str_view(fruit, "(..)\\1", match = TRUE)
|
||||
```
|
||||
|
||||
|
@ -461,7 +454,7 @@ mean(str_count(common, "[aeiou]"))
|
|||
|
||||
Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
str_count("abababa", "aba")
|
||||
str_view_all("abababa", "aba")
|
||||
```
|
||||
|
@ -510,7 +503,7 @@ head(matches)
|
|||
|
||||
Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
more <- sentences[str_count(sentences, colour_match) > 1]
|
||||
str_view_all(more, colour_match)
|
||||
|
||||
|
@ -646,7 +639,7 @@ fields %>% str_split(": ", n = 2, simplify = TRUE)
|
|||
|
||||
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- "This is a sentence. This is another sentence."
|
||||
str_view_all(x, boundary("word"))
|
||||
|
||||
|
@ -683,7 +676,7 @@ You can use the other arguments of `regex()` to control details of the match:
|
|||
* `ignore_case = TRUE` allows characters to match either their uppercase or
|
||||
lowercase forms. This always uses the current locale.
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
bananas <- c("banana", "Banana", "BANANA")
|
||||
str_view(bananas, "banana")
|
||||
str_view(bananas, regex("banana", ignore_case = TRUE))
|
||||
|
@ -692,7 +685,7 @@ You can use the other arguments of `regex()` to control details of the match:
|
|||
* `multiline = TRUE` allows `^` and `$` to match the start and end of each
|
||||
line rather than the start and end of the complete string.
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view_all(x, "^Line")
|
||||
str_view_all(x, regex("^Line", multiline = TRUE))
|
||||
|
@ -773,7 +766,7 @@ There are three other functions you can use instead of `regex()`:
|
|||
* As you saw with `str_split()` you can use `boundary()` to match boundaries.
|
||||
You can also use it with the other functions, all though
|
||||
|
||||
```{r}
|
||||
```{r, cache = FALSE}
|
||||
x <- "This is a sentence."
|
||||
str_view_all(x, boundary("word"))
|
||||
str_extract_all(x, boundary("word"))
|
||||
|
|
45
tidy.Rmd
45
tidy.Rmd
|
@ -1,9 +1,3 @@
|
|||
---
|
||||
layout: default
|
||||
title: Tidy Data
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Tidy data
|
||||
|
||||
> "Tidy datasets are all alike but every messy dataset is messy in its
|
||||
|
@ -68,7 +62,10 @@ R follows a set of conventions that makes one layout of tabular data much easier
|
|||
|
||||
Data that satisfies these rules is known as *tidy data*. Notice that `table1` is tidy data.
|
||||
|
||||
`r bookdown::embed_png("images/tidy-1.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-1.png")
|
||||
```
|
||||
|
||||
*In `table1`, each variable is placed in its own column, each observation in its own row, and each value in its own cell.*
|
||||
|
||||
Tidy data builds on a premise of data science that data sets contain *both values and relationships*. Tidy data displays the relationships in a data set as consistently as it displays the values in a data set.
|
||||
|
@ -79,7 +76,10 @@ Tidy data works well with R because it takes advantage of R's traits as a vector
|
|||
|
||||
Tidy data arranges values so that the relationships between variables in a data set will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the data set is assigned to its own column, i.e., its own vector in the data frame.
|
||||
|
||||
`r bookdown::embed_png("images/tidy-2.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-2.png")
|
||||
```
|
||||
|
||||
*A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.*
|
||||
|
||||
As a result, you can extract the all of the values of a variable in a tidy data set by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
|
||||
|
@ -111,7 +111,9 @@ table1$population / table1$cases
|
|||
|
||||
To create the output, R applies the function in element-wise fashion: R first applies the function (or operation) to the first elements of each vector involved. Then R applies the function (or operation) to the second elements of each vector involved, and so on until R reaches the end of the vectors. If one vector is shorter than the others, R will recycle its values as needed (according to a set of recycling rules).
|
||||
|
||||
`r bookdown::embed_png("images/tidy-3.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-3.png")
|
||||
```
|
||||
|
||||
If your data is tidy, element-wise execution will ensure that observations are preserved across functions and operations. Each value will only be paired with other values that appear in the same row of the data frame. In a tidy data frame, these values will be values of the same observation.
|
||||
|
||||
|
@ -129,7 +131,9 @@ If you use basic R syntax, your calculations will look like the code below. If y
|
|||
|
||||
#### Data set one
|
||||
|
||||
`r bookdown::embed_png("images/tidy-4.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-4.png")
|
||||
```
|
||||
|
||||
Since `table1` is organized in a tidy fashion, you can calculate the rate like this,
|
||||
|
||||
|
@ -140,7 +144,9 @@ table1$cases / table1$population * 10000
|
|||
|
||||
#### Data set two
|
||||
|
||||
`r bookdown::embed_png("images/tidy-5.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-5.png")
|
||||
```
|
||||
|
||||
Data set two intermingles the values of *population* and *cases* in the same column, *value*. As a result, you will need to untangle the values whenever you want to work with each variable separately.
|
||||
|
||||
|
@ -155,7 +161,9 @@ table2$value[case_rows] / table2$value[pop_rows] * 10000
|
|||
|
||||
#### Data set three
|
||||
|
||||
`r bookdown::embed_png("images/tidy-6.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-6.png")
|
||||
```
|
||||
|
||||
Data set three combines the values of cases and population into the same cells. It may seem that this would help you calculate the rate, but that is not so. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with "basic" R syntax.
|
||||
|
||||
|
@ -166,7 +174,9 @@ Data set three combines the values of cases and population into the same cells.
|
|||
|
||||
#### Data set four
|
||||
|
||||
`r bookdown::embed_png("images/tidy-7.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-7.png")
|
||||
```
|
||||
|
||||
Data set four stores the values of each variable in a different format: as a column, a set of column names, or a field of cells. As a result, you will need to work with each variable differently. This makes code written for data set four hard to generalize. The code that extracts the values of *year*, `names(table4)[-1]`, cannot be generalized to extract the values of population, `c(table5$1999, table5$2000, table5$2001)`. Compare this to data set one. With `table1`, you can use the same code to extract the values of year, `table1$year`, that you use to extract the values of population. To do so, you only need to change the name of the variable that you will access: `table1$population`.
|
||||
|
||||
|
@ -248,7 +258,10 @@ spread(table2, key, value)
|
|||
|
||||
`spread()` returns a copy of your data set that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
|
||||
|
||||
`r bookdown::embed_png("images/tidy-8.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-8.png")
|
||||
```
|
||||
|
||||
*`spread()` distributes a pair of key:value columns into a field of cells. The unique keys in the key column become the column names of the field of cells.*
|
||||
|
||||
You can see that `spread()` maintains each of the relationships expressed in the original data set. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the orginal observations. As a bonus, now the layout of these relationships is tidy.
|
||||
|
@ -279,7 +292,9 @@ gather(table4, "year", "cases", 2:3)
|
|||
|
||||
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formally in the column names, a place where keys belong.
|
||||
|
||||
`r bookdown::embed_png("images/tidy-9.png", 220)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/tidy-9.png")
|
||||
```
|
||||
|
||||
Just like `spread()`, gather maintains each of the relationships in the original data set. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion.
|
||||
|
||||
|
|
|
@ -1,12 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Data transformation
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Data transformation {#transform}
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
```{r setup-transform, include=FALSE}
|
||||
library(dplyr)
|
||||
library(nycflights13)
|
||||
source("common.R")
|
||||
|
|
|
@ -1,12 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Data Visualization
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
# Data visualisation
|
||||
|
||||
```{r setup, include = FALSE}
|
||||
```{r setup-visualise, include = FALSE}
|
||||
knitr::opts_chunk$set(
|
||||
cache = TRUE,
|
||||
fig.path = "figures/"
|
||||
|
@ -96,7 +90,9 @@ The graph shows a negative relationship between engine size (`displ`) and fuel e
|
|||
|
||||
One group of points seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. Can you tell why? Before we examine these cars, let's review the code that made our graph.
|
||||
|
||||
`r bookdown::embed_png("images/visualization-1.png", dpi = 300)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/visualization-1.png")
|
||||
```
|
||||
|
||||
#### Template
|
||||
|
||||
|
@ -134,7 +130,9 @@ You can add a third value, like `class`, to a two dimensional scatterplot by map
|
|||
|
||||
An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, trianglular, or blue.
|
||||
|
||||
`r bookdown::embed_png("images/visualization-2.png", dpi = 300)`
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/visualization-2.png")
|
||||
```
|
||||
|
||||
You can convey information about your data by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of our points to the `class` variable. Then the color of each point will reveal its class affiliation.
|
||||
|
||||
|
@ -304,8 +302,6 @@ In practice, `ggplot2` will automatically detect when it needs to group the data
|
|||
|
||||
***
|
||||
|
||||
`r bookdown::embed_png("images/blank.png", dpi = 300)`
|
||||
|
||||
***
|
||||
|
||||
#### Layers
|
||||
|
@ -532,12 +528,8 @@ Some graphs, like scatterplots, plot the raw values of your data set. Other grap
|
|||
|
||||
`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a default stat that it uses to plot your data. `geom_bar()` uses the "count" stat, which computes a data set of counts for each x value from your raw data. `geom_bar()` then uses this computed data to make the plot.
|
||||
|
||||
`r bookdown::embed_png("images/blank.png", dpi = 300)`
|
||||
|
||||
A few geoms, like `geom_point()`, plot your raw data as it is. To keep things simple, let's imagine that these geoms also transform the data. They just use a very lame transformation, the identity transformation, which returns the data in its original state. Now we can say that _every_ geom uses a stat.
|
||||
|
||||
`r bookdown::embed_png("images/blank.png", dpi = 300)`
|
||||
|
||||
You can learn which stat a geom uses, as well as what variables it computes by visiting the geom's help page. For example, the help page of `geom_bar()` shows that it uses the count stat and that the count stat computes two new variables, `count` and `prop`. If you have an R session open---and you should!---you can verify this by running `?geom_bar` at the command line.
|
||||
|
||||
Stats are the most subtle part of plotting because you do not see them in action. `ggplot2` applies the transformation and stores the results behind the scenes. You only see the finished plot. Moreover, `ggplot2` applies stats automatically, with a very intuitive set of defaults. So why bother thinking about them? Because you can use stats to do three very useful things.
|
||||
|
@ -589,7 +581,6 @@ Use consideration when you change a geom's stat. Many combinations of geoms and
|
|||
|
||||
***
|
||||
|
||||
`r bookdown::embed_png("images/blank.png", dpi = 300)`
|
||||
|
||||
***
|
||||
|
||||
|
@ -638,8 +629,6 @@ ggplot(data = diamonds) +
|
|||
|
||||
***
|
||||
|
||||
`r bookdown::embed_png("images/blank.png", dpi = 300)`
|
||||
|
||||
***
|
||||
|
||||
***
|
||||
|
@ -724,8 +713,6 @@ To see how this works, consider how you could build a basic plot from scratch: y
|
|||
|
||||
***
|
||||
|
||||
`r bookdown::embed_png("images/blank.png", dpi = 300)`
|
||||
|
||||
***
|
||||
|
||||
Although this method may seem complicated, you could use it to build _any_ plot that you imagine. In other words, you can use the code template that you've learned in this chapter to build hundreds of thousnds of unique plots.
|
||||
|
|
Loading…
Reference in New Issue