Fix minor plaintext typos (#1085)
Co-authored-by: Betul Turkoglu <79486373+betulturkoglu@users.noreply.github.com>
This commit is contained in:
parent
5c5774f86d
commit
7db1d7018a
2
EDA.qmd
2
EDA.qmd
|
@ -415,7 +415,7 @@ nycflights13::flights |>
|
|||
cancelled = is.na(dep_time),
|
||||
sched_hour = sched_dep_time %/% 100,
|
||||
sched_min = sched_dep_time %% 100,
|
||||
sched_dep_time = sched_hour + sched_min / 60
|
||||
sched_dep_time = sched_hour + (sched_min / 60)
|
||||
) |>
|
||||
ggplot(mapping = aes(sched_dep_time)) +
|
||||
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)
|
||||
|
|
|
@ -476,7 +476,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
scale_colour_brewer(palette = "Set1")
|
||||
```
|
||||
|
||||
The ColorBrewer scales are documented online at <http://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
|
||||
The ColorBrewer scales are documented online at <https://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
|
||||
@fig-brewer shows the complete list of all palettes.
|
||||
The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle".
|
||||
This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
|
||||
|
|
|
@ -610,7 +610,7 @@ Many geoms, like `geom_smooth()`, use a single geometric object to display multi
|
|||
For these geoms, you can set the `group` aesthetic to a categorical variable to draw multiple objects.
|
||||
ggplot2 will draw a separate object for each unique value of the grouping variable.
|
||||
In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example).
|
||||
It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.
|
||||
It is convenient to rely on this feature because the `group` aesthetic by itself does not add a legend or distinguishing features to the geoms.
|
||||
|
||||
```{r}
|
||||
#| layout-ncol: 3
|
||||
|
|
|
@ -10,7 +10,7 @@ status("polishing")
|
|||
## Introduction
|
||||
|
||||
A huge amount of data lives in databases, so it's essential that you know how to access it.
|
||||
Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
|
||||
Sometimes you can ask someone to download a snapshot into a `.csv` for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
|
||||
You want to be able to reach into the database directly to get the data you need, when you need it.
|
||||
|
||||
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^databases-1] query.
|
||||
|
|
|
@ -111,7 +111,7 @@ ymd("2017-01-31", tz = "UTC")
|
|||
### From individual components
|
||||
|
||||
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns.
|
||||
This is what we have in the flights data:
|
||||
This is what we have in the `flights` data:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -630,7 +630,7 @@ Over the course of decades, countries change names (or break apart) fairly frequ
|
|||
Another problem is that the name needs to reflect not only the current behavior, but also the complete history.
|
||||
For example, there are time zones for both "America/New_York" and "America/Detroit".
|
||||
These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name.
|
||||
It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
|
||||
It's worth reading the raw time zone database (available at <https://www.iana.org/time-zones>) just to read some of these stories!
|
||||
|
||||
You can find out what R thinks your current time zone is with `Sys.timezone()`:
|
||||
|
||||
|
|
|
@ -507,7 +507,7 @@ if (y < 20) {
|
|||
If the number is divisible by three, it returns "fizz".
|
||||
If it's divisible by five it returns "buzz".
|
||||
If it's divisible by three and five, it returns "fizzbuzz".
|
||||
Otherwise, it returns the number.
|
||||
Otherwise, it returns the number itself.
|
||||
Make sure you first write working code before you create the function.
|
||||
|
||||
4. How could you use `cut()` to simplify this set of nested if-else statements?
|
||||
|
@ -568,7 +568,7 @@ Detail arguments should go on the end, and usually should have default values.
|
|||
You specify a default value in the same way you call a function with a named argument:
|
||||
|
||||
```{r}
|
||||
# Compute confidence interval around mean using normal approximation
|
||||
# Compute confidence interval around the mean using normal approximation
|
||||
mean_ci <- function(x, conf = 0.95) {
|
||||
se <- sd(x) / sqrt(length(x))
|
||||
alpha <- 1 - conf
|
||||
|
@ -717,7 +717,7 @@ wt_mean(1:6, 6:1, na.rm = "foo")
|
|||
|
||||
Note that when using `stopifnot()` you assert what should be true rather than checking for what might be wrong.
|
||||
|
||||
### Dot-dot-dot (...)
|
||||
### Dot-dot-dot (`...`)
|
||||
|
||||
Many functions in R take an arbitrary number of inputs:
|
||||
|
||||
|
|
|
@ -79,7 +79,7 @@ Every `for` loop has three components:
|
|||
This is very important for efficiency: if you grow the `for` loop at each iteration using `c()` (for example), your `for` loop will be very slow.
|
||||
|
||||
A general way of creating an empty vector of given length is the `vector()` function.
|
||||
It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc) and the length of the vector.
|
||||
It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc.) and the length of the vector.
|
||||
|
||||
2. The **sequence**: `i in seq_along(df)`.
|
||||
This determines what to loop over: each run of the `for` loop will assign `i` to a different value from `seq_along(df)`.
|
||||
|
@ -843,7 +843,7 @@ As soon as your code gets complicated, we think a data frame is a good approach
|
|||
## Walk {#sec-walk}
|
||||
|
||||
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value.
|
||||
You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value.
|
||||
You typically do this because you want to render output to the screen or save files to disk --- the important thing is the action, not the return value.
|
||||
Here's a very simple example:
|
||||
|
||||
```{r}
|
||||
|
|
|
@ -95,7 +95,7 @@ You don't need to understand the whole thing; you just need to understand the ch
|
|||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| Connections between all five data frames in the nycflights package.
|
||||
#| Connections between all five data frames in the nycflights13 package.
|
||||
#| Variables making up a primary key are coloured grey, and are connected
|
||||
#| to their correpsonding foreign keys with arrows.
|
||||
#| fig-alt: >
|
||||
|
@ -875,7 +875,7 @@ Overlap joins provide three helpers that use inequality joins to make it easier
|
|||
|
||||
Let's continue the birthday example to see how you might use them.
|
||||
There's one problem with the strategy we used above: there's no party preceding the birthdays Jan 1-9.
|
||||
So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early bithdays:
|
||||
So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:
|
||||
|
||||
```{r}
|
||||
parties <- tibble(
|
||||
|
|
|
@ -432,7 +432,7 @@ There are two important tools for this: `if_else()` and `case_when()`.
|
|||
### `if_else()`
|
||||
|
||||
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `dplyr::if_else()`[^logicals-4].
|
||||
You'll always use the first three argument of `if_else(`).
|
||||
You'll always use the first three argument of `if_else()`.
|
||||
The first argument, `condition`, is a logical vector, the second, `true`, gives the output when the condition is true, and the third, `false`, gives the output if the condition is false.
|
||||
|
||||
[^logicals-4]: dplyr's `if_else()` is very similar to base R's `ifelse()`.
|
||||
|
@ -468,7 +468,7 @@ if_else(is.na(x1), y1, x1)
|
|||
```
|
||||
|
||||
You might have noticed a small infelicity in our labeling: zero is neither positive nor negative.
|
||||
We could resolve this by adding an additional `if_else():`
|
||||
We could resolve this by adding an additional `if_else()`:
|
||||
|
||||
```{r}
|
||||
if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
|
||||
|
|
|
@ -156,7 +156,7 @@ flights |>
|
|||
|
||||
The code runs without error, but it doesn't return what you want.
|
||||
Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February.
|
||||
And unforuntately there's no warning because `nycflights` has an even number of rows.
|
||||
And unforuntately there's no warning because `flights` has an even number of rows.
|
||||
|
||||
To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values.
|
||||
Unfortunately that doesn't help here, or in many other cases, because the key computation is performed by the base R function `==`, not `filter()`.
|
||||
|
@ -607,7 +607,7 @@ What if you're interested in locations other than the center?
|
|||
`min()` and `max()` will give you the largest and smallest values.
|
||||
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find a value that's greater than 95% of the values.
|
||||
|
||||
For the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
|
||||
For the `flights` data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -625,7 +625,7 @@ Sometimes you're not so interested in where the bulk of the data lies, but how i
|
|||
Two commonly used summaries are the standard deviation, `sd(x)`, and the inter-quartile range, `IQR()`.
|
||||
We won't explain `sd()` here since you're probably already familiar with it, but `IQR()` might be new --- it's `quantile(x, 0.75) - quantile(x, 0.25)` and gives you the range that contains the middle 50% of the data.
|
||||
|
||||
We can use this to reveal a small oddity in the flights data.
|
||||
We can use this to reveal a small oddity in the `flights` data.
|
||||
You might expect that the spread of the distance between origin and destination to be zero, since airports are always in the same place.
|
||||
But the code below makes it looks like one airport, [EGE](https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport), might have moved.
|
||||
|
||||
|
@ -675,7 +675,7 @@ flights |>
|
|||
geom_histogram(binwidth = 15)
|
||||
|
||||
flights |>
|
||||
filter(dep_delay <= 120) |>
|
||||
filter(dep_delay < 120) |>
|
||||
ggplot(aes(dep_delay)) +
|
||||
geom_histogram(binwidth = 5)
|
||||
```
|
||||
|
|
|
@ -769,7 +769,7 @@ str(parse_json('{"x": [1, 2, 3]}'))
|
|||
```
|
||||
|
||||
jsonlite has another important function called `fromJSON()`.
|
||||
We don't use it here because it performs automatic simplification (`simplifyVector = TRUE)`.
|
||||
We don't use it here because it performs automatic simplification (`simplifyVector = TRUE`).
|
||||
This often works well, particularly in simple cases, but we think you're better off doing the rectangling yourself so you know exactly what's happening and can more easily handle the most complicated nested structures.
|
||||
|
||||
### Starting the rectangling process
|
||||
|
|
|
@ -78,8 +78,8 @@ This will display the report in the viewer pane, and create a self-contained HTM
|
|||
knitr::include_graphics("rmarkdown/diamond-sizes-report.png")
|
||||
```
|
||||
|
||||
When you **knit** the document, R Markdown sends the .Rmd file to **knitr**, <http://yihui.name/knitr/>, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output.
|
||||
The markdown file generated by knitr is then processed by **pandoc**, <http://pandoc.org/>, which is responsible for creating the finished file.
|
||||
When you **knit** the document, R Markdown sends the .Rmd file to **knitr**, <https://yihui.name/knitr/>, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output.
|
||||
The markdown file generated by knitr is then processed by **pandoc**, <https://pandoc.org/>, which is responsible for creating the finished file.
|
||||
The advantage of this two step workflow is that you can create a very wide range of output formats, as you'll learn about in \[R Markdown formats\].
|
||||
|
||||
```{r}
|
||||
|
@ -202,7 +202,7 @@ When you're in a notebook mode, the chunk named setup will be run automatically
|
|||
Chunk output can be customised with **options**, arguments supplied to chunk header.
|
||||
Knitr provides almost 60 options that you can use to customize your code chunks.
|
||||
Here we'll cover the most important chunk options that you'll use frequently.
|
||||
You can see the full list at <http://yihui.name/knitr/options/>.
|
||||
You can see the full list at <https://yihui.name/knitr/options/>.
|
||||
|
||||
The most important set of options controls if your code block is executed and what results are inserted in the finished report:
|
||||
|
||||
|
@ -432,7 +432,7 @@ params:
|
|||
|
||||
In RStudio, you can click the "Knit with Parameters" option in the Knit dropdown menu to set parameters, render, and preview the report in a single user friendly step.
|
||||
You can customise the dialog by setting other options in the header.
|
||||
See <http://rmarkdown.rstudio.com/developer_parameterized_reports.html#parameter_user_interfaces> for more details.
|
||||
See <https://rmarkdown.rstudio.com/developer_parameterized_reports.html#parameter_user_interfaces> for more details.
|
||||
|
||||
Alternatively, if you need to produce many such parameterised reports, you can call `rmarkdown::render()` with a list of `params`:
|
||||
|
||||
|
|
|
@ -184,7 +184,7 @@ However, there are some functions that are strict about their inputs, so it's us
|
|||
3. A logical vector can take 3 possible values.
|
||||
How many possible values can an integer vector take?
|
||||
How many possible values can a double take?
|
||||
Use google to do some research.
|
||||
Use Google to do some research.
|
||||
|
||||
4. Brainstorm at least four functions that allow you to convert a double to an integer.
|
||||
How do they differ?
|
||||
|
@ -737,7 +737,7 @@ attributes(tb)
|
|||
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length.
|
||||
All functions that work with tibbles enforce this constraint.
|
||||
|
||||
Traditional data.frames have a very similar structure:
|
||||
Traditional `data.frame`s have a very similar structure:
|
||||
|
||||
```{r}
|
||||
df <- data.frame(x = 1:5, y = 5:1)
|
||||
|
|
Loading…
Reference in New Issue