Fix typos and UK spellings (#1314)
This commit is contained in:
parent
e3cb556344
commit
89f587b75e
|
@ -142,8 +142,8 @@ This means that:
|
|||
In contrast, parquet files store data in a way that records the type along with the data.
|
||||
|
||||
- Parquet files are "column-oriented".
|
||||
This means that they're organised column-by-column, much like R's data frame.
|
||||
This typically leads to better performance for data analysis tasks compared to CSV files, which are organised row-by-row.
|
||||
This means that they're organized column-by-column, much like R's data frame.
|
||||
This typically leads to better performance for data analysis tasks compared to CSV files, which are organized row-by-row.
|
||||
|
||||
- Parquet files are "chunked", which makes it possible to work on different parts of the file at the same time, and, if you're lucky, to skip some chunks all together.
|
||||
|
||||
|
|
|
@ -123,7 +123,7 @@ table1 |>
|
|||
table1 |>
|
||||
count(year, wt = cases)
|
||||
|
||||
# Visualise changes over time
|
||||
# Visualize changes over time
|
||||
ggplot(table1, aes(x = year, y = cases)) +
|
||||
geom_line(aes(group = country), color = "grey50") +
|
||||
geom_point(aes(color = country, shape = country)) +
|
||||
|
@ -132,7 +132,7 @@ ggplot(table1, aes(x = year, y = cases)) +
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Using words, describe how the variables and observations are organised in each of the sample tables.
|
||||
1. Using words, describe how the variables and observations are organized in each of the sample tables.
|
||||
|
||||
2. Sketch out the process you'd use to calculate the `rate` for `table2` and `table3`.
|
||||
You will need to perform four operations:
|
||||
|
@ -150,7 +150,7 @@ The principles of tidy data might seem so obvious that you wonder if you'll ever
|
|||
Unfortunately, however, most real data is untidy.
|
||||
There are two main reasons:
|
||||
|
||||
1. Data is often organised to facilitate some goal other than analysis.
|
||||
1. Data is often organized to facilitate some goal other than analysis.
|
||||
For example, it's common for data to be structured to make data entry, not analysis, easy.
|
||||
|
||||
2. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a lot of time working with data.
|
||||
|
@ -243,7 +243,7 @@ The code is shown below and the result is @fig-billboard-ranks.
|
|||
#| A line plot with week on the x-axis and rank on the y-axis, where
|
||||
#| each line represents a song. Most songs appear to start at a high rank,
|
||||
#| rapidly accelerate to a low rank, and then decay again. There are
|
||||
#| suprisingly few tracks in the region when week is >20 and rank is
|
||||
#| surprisingly few tracks in the region when week is >20 and rank is
|
||||
#| >50.
|
||||
|
||||
billboard_tidy |>
|
||||
|
@ -448,7 +448,7 @@ We'll start by looking at `cms_patient_experience`, a dataset from the Centers o
|
|||
cms_patient_experience
|
||||
```
|
||||
|
||||
An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure.
|
||||
An observation is an organization, but each organization is spread across six rows, with one row for each variable, or measure.
|
||||
We can see the complete set of values for `measure_cd` and `measure_title` by using `distinct()`:
|
||||
|
||||
```{r}
|
||||
|
@ -470,7 +470,7 @@ cms_patient_experience |>
|
|||
```
|
||||
|
||||
The output doesn't look quite right; we still seem to have multiple rows for each organization.
|
||||
That's because, by default, `pivot_wider()` will attempt to preserve all the existing columns including `measure_title` which has six distinct observations for each organisations.
|
||||
That's because, by default, `pivot_wider()` will attempt to preserve all the existing columns including `measure_title` which has six distinct observations for each organizations.
|
||||
To fix this problem we need to tell `pivot_wider()` which columns identify each row; in this case those are the variables starting with `"org"`:
|
||||
|
||||
```{r}
|
||||
|
|
|
@ -9,7 +9,7 @@ status("complete")
|
|||
|
||||
## Introduction
|
||||
|
||||
Visualisation is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need for it.
|
||||
Visualization is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need for it.
|
||||
Often you'll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
|
||||
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights that departed New York City in 2013.
|
||||
|
||||
|
@ -90,7 +90,7 @@ flights |>
|
|||
The code starts with the `flights` dataset, then filters it, then groups it, then summarizes it.
|
||||
We'll come back to the pipe and its alternatives in @sec-pipes.
|
||||
|
||||
dplyr's verbs are organised into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
|
||||
dplyr's verbs are organized into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
|
||||
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to verbs that work on tables in @sec-joins.
|
||||
Let's dive in!
|
||||
|
||||
|
@ -522,7 +522,7 @@ daily
|
|||
```
|
||||
|
||||
When you summarize a tibble grouped by more than one variable, each summary peels off the last group.
|
||||
In hindsight, this wasn't great way to make this function work, but it's difficult to change without breaking existing code.
|
||||
In hindsight, this wasn't a great way to make this function work, but it's difficult to change without breaking existing code.
|
||||
To make it obvious what's happening, dplyr displays a message that tells you how you can change this behavior:
|
||||
|
||||
```{r}
|
||||
|
|
|
@ -53,7 +53,7 @@ There are three high level differences between data frames and database tables:
|
|||
|
||||
Databases are run by database management systems (**DBMS**'s for short), which come in three basic forms:
|
||||
|
||||
- **Client-server** DBMS's run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMS's include PostgreSQL, MariaDB, SQL Server, and Oracle.
|
||||
- **Client-server** DBMS's run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS's include PostgreSQL, MariaDB, SQL Server, and Oracle.
|
||||
- **Cloud** DBMS's, like Snowflake, Amazon's RedShift, and Google's BigQuery, are similar to client server DBMS's, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
|
||||
- **In-process** DBMS's, like SQLite or duckdb, run entirely on your computer. They're great for working with large datasets where you're the primary user.
|
||||
|
||||
|
@ -203,7 +203,7 @@ diamonds_db
|
|||
|
||||
::: callout-note
|
||||
There are two other common ways to interact with a database.
|
||||
First, many corporate databases are very large so you need some hierarchy to keep all the tables organised.
|
||||
First, many corporate databases are very large so you need some hierarchy to keep all the tables organized.
|
||||
In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you're interested in:
|
||||
|
||||
```{r}
|
||||
|
@ -680,7 +680,7 @@ We used that translation to teach you a little SQL; it's important to learn some
|
|||
If you've finished this chapter and would like to learn more about SQL.
|
||||
We have two recommendations:
|
||||
|
||||
- [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations.
|
||||
- [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organizations.
|
||||
- [*Practical SQL*](https://www.practicalsql.com) by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.
|
||||
|
||||
In the next chapter, we'll learn about another dplyr backend for working with large data: arrow.
|
||||
|
|
|
@ -92,7 +92,7 @@ csv <- "
|
|||
read_csv(csv)
|
||||
```
|
||||
|
||||
If you haven't heard of **ISO8601** before, it's an international standard[^datetimes-2] for writing dates where the components of a date are organised from biggest to smallest separated by `-`. For example, in ISO8601 March 5 2022 is `2022-05-03`. ISO8601 dates can also include times, where hour, minute, and second are separated by `:`, and the date and time components are separated by either a `T` or a space.
|
||||
If you haven't heard of **ISO8601** before, it's an international standard[^datetimes-2] for writing dates where the components of a date are organized from biggest to smallest separated by `-`. For example, in ISO8601 March 5 2022 is `2022-05-03`. ISO8601 dates can also include times, where hour, minute, and second are separated by `:`, and the date and time components are separated by either a `T` or a space.
|
||||
For example, you could write 4:26pm on March 5 2022 as either `2022-05-03 16:26` or `2022-05-03T16:26`.
|
||||
|
||||
[^datetimes-2]: <https://xkcd.com/1179/>
|
||||
|
|
|
@ -264,8 +264,8 @@ This makes the plot easier to read because the colors of the line at the far rig
|
|||
#|
|
||||
#| Rearranging the legend makes the plot easier to read because the
|
||||
#| legend colors now match the order of the lines on the far right
|
||||
#| of the plot. You can see some unsuprising patterns: the proportion
|
||||
#| never marred decreases with age, married forms an upside down U
|
||||
#| of the plot. You can see some unsurprising patterns: the proportion
|
||||
#| never married decreases with age, married forms an upside down U
|
||||
#| shape, and widowed starts off low but increases steeply after age
|
||||
#| 60.
|
||||
by_age <- gss_cat |>
|
||||
|
|
|
@ -195,7 +195,7 @@ Surrogate keys can be particular useful when communicating to other humans: it's
|
|||
|
||||
5. Draw a diagram illustrating the connections between the `Batting`, `People`, and `Salaries` data frames in the Lahman package.
|
||||
Draw another diagram that shows the relationship between `People`, `Managers`, `AwardsManagers`.
|
||||
How would you characterise the relationship between the `Batting`, `Pitching`, and `Fielding` data frames?
|
||||
How would you characterize the relationship between the `Batting`, `Pitching`, and `Fielding` data frames?
|
||||
|
||||
## Basic joins {#sec-mutating-joins}
|
||||
|
||||
|
|
|
@ -160,7 +160,7 @@ You'll need to pick a value that makes sense for that aesthetic:
|
|||
#| fig.asp: 0.364
|
||||
#| fig-align: "center"
|
||||
#| fig-cap: >
|
||||
#| R has 25 built in shapes that are identified by numbers. There are some
|
||||
#| R has 25 built-in shapes that are identified by numbers. There are some
|
||||
#| seeming duplicates: for example, 0, 15, and 22 are all squares. The
|
||||
#| difference comes from the interaction of the `color` and `fill`
|
||||
#| aesthetics. The hollow shapes (0--14) have a border determined by `color`;
|
||||
|
@ -1037,9 +1037,9 @@ If you'd like to learn more about the theoretical underpinnings of ggplot2, you
|
|||
|
||||
## Summary
|
||||
|
||||
In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what `x` and `y` mean.
|
||||
In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems which allow you to fundamentally change what `x` and `y` mean.
|
||||
One layer we have not yet touched on is theme, which we will introduce in @sec-themes.
|
||||
|
||||
Two very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at <https://posit.co/resources/cheatsheets> ) and the ggplot2 package website ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org/)).
|
||||
Two very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at <https://posit.co/resources/cheatsheets>) and the ggplot2 package website ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org/)).
|
||||
|
||||
An important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it's always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom.
|
||||
|
|
|
@ -584,7 +584,7 @@ The median delay is always smaller than the mean delay because flights sometimes
|
|||
```{r}
|
||||
#| label: fig-mean-vs-median
|
||||
#| fig-cap: >
|
||||
#| A scatterplot showing the differences of summarising hourly depature
|
||||
#| A scatterplot showing the differences of summarizing hourly depature
|
||||
#| delay with median instead of mean.
|
||||
#| fig-alt: >
|
||||
#| All points fall below a 45° line, meaning that the median delay is
|
||||
|
|
|
@ -28,10 +28,10 @@ knitr::include_graphics("diagrams/data-science/whole-game.png", dpi = 270)
|
|||
|
||||
Five chapters focus on the tools of data science:
|
||||
|
||||
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
|
||||
- Visualization is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
|
||||
In @sec-data-visualization you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
|
||||
|
||||
- Visualisation alone is typically not enough, so in @sec-data-transform, you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
|
||||
- Visualization alone is typically not enough, so in @sec-data-transform, you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
|
||||
|
||||
- In @sec-data-tidy, you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier.
|
||||
You'll learn the underlying principles, and how to get your data into a tidy form.
|
||||
|
@ -41,5 +41,5 @@ Five chapters focus on the tools of data science:
|
|||
|
||||
Nestled among these chapters are five other chapters that focus on your R workflow.
|
||||
In @sec-workflow-basics, @sec-workflow-pipes, @sec-workflow-style, and @sec-workflow-scripts-projects you'll learn good workflow practices for writing and organizing your R code.
|
||||
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.
|
||||
These will set you up for success in the long run, as they'll give you the tools to stay organized when you tackle real projects.
|
||||
Finally, @sec-workflow-getting-help will teach you how to get help to keep learning.
|
||||
|
|
|
@ -206,7 +206,7 @@ We collectively use this pattern hundreds of times a week.
|
|||
If you're using RStudio server, your R session is never restarted by default.
|
||||
When you close your RStudio server tab, it might feel like you're closing R, but the server actually keeps it running in the background.
|
||||
The next time you return, you'll be in exactly the same place you left.
|
||||
This makes it even more important to regularly restart R so that you're starting with a refresh slate.
|
||||
This makes it even more important to regularly restart R so that you're starting with a clean slate.
|
||||
:::
|
||||
|
||||
### Where does your analysis live?
|
||||
|
|
Loading…
Reference in New Issue