Change `data set` to `dataset` (#1282)

- It changes `data set(s)` to `dataset(s)` for consistency, throughout the book.
- It adds `# Left` and `# Right` comments for similar side-by-side plots.
This commit is contained in:
Zeki Akyol 2023-02-14 16:32:09 +03:00 committed by GitHub
parent 61a4ce719d
commit 5cfe902d8c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 17 additions and 11 deletions

View File

@ -18,7 +18,7 @@ They're a good interchange format because they're very simple and they can be re
But CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.
In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large data sets.
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large datasets.
We'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
As an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.
@ -49,7 +49,7 @@ library(duckdb)
## Getting the data
We begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
We begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.
The following code will get you a cached copy of the data.
@ -88,7 +88,7 @@ seattle_csv <- open_dataset(
(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)
What happens when this code is run?
`open_dataset()` will scan a few thousand rows to figure out the structure of the data set.
`open_dataset()` will scan a few thousand rows to figure out the structure of the dataset.
Then it records what it's found and stops; it will only read further rows as you specifically request them.
This metadata is what we see if we print `seattle_csv`:
@ -152,7 +152,7 @@ This means that:
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.
There are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data.
There are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data.
You're likely to need to do some experimentation before you find the ideal partitioning for your situation.
As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files.
You should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.
@ -247,7 +247,7 @@ seattle_csv |>
system.time()
```
Now let's use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:
Now let's use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:
```{r dataset-performance-multiple-parquet}
#| cache: true

View File

@ -774,11 +774,13 @@ Compare the following two plots:
#| fig-height: 3
#| message: false
# Left
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
# Right
mpg |>
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
ggplot(aes(x = displ, y = hwy)) +
@ -799,9 +801,11 @@ For example, if we extract two classes of cars and plot them separately, it's di
suv <- mpg |> filter(class == "suv")
compact <- mpg |> filter(class == "compact")
# Left
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
geom_point()
# Right
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point()
```
@ -817,12 +821,14 @@ x_scale <- scale_x_continuous(limits = range(mpg$displ))
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
col_scale <- scale_color_discrete(limits = unique(mpg$drv))
# Left
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
# Right
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +

View File

@ -663,7 +663,7 @@ diamonds |>
### More variables
It's straightforward to add more variables to the mix.
For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:
For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:
```{r}
# https://twitter.com/tyler_js_smith/status/1574377116988104704

View File

@ -57,7 +57,7 @@ Visualizations can surprise you, and they don't scale particularly well because
**Models** are complementary tools to visualization.
Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well.
Even when they don\'t, it\'s usually cheaper to buy more computers than it is to buy more brains!
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
That means a model cannot fundamentally surprise you.
@ -105,7 +105,7 @@ This book doesn't teach data.table because it has a very concise interface that
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
While the complete data set might be big, often, the data needed to answer a specific question is small.
While the complete dataset might be big, often, the data needed to answer a specific question is small.
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
The challenge here is finding the right small data, which often requires a lot of iteration.

View File

@ -106,7 +106,7 @@ sqrt(-1)
So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
Let's illustrate the difference with a simple data set that records the price of some stock each quarter:
Let's illustrate the difference with a simple dataset that records the price of some stock each quarter:
```{r}
stocks <- tibble(

View File

@ -224,7 +224,7 @@ In R, `%/%` does integer division and `%%` computes the remainder:
1:10 %% 3
```
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
Modular arithmetic is handy for the `flights` dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
```{r}
flights |>

View File

@ -7,7 +7,7 @@ source("_common.R")
```
Our goal in this part of the book is to give you a rapid overview of the main tools of data science: **importing**, **tidying**, **transforming**, and **visualizing data**, as shown in @fig-ds-whole-game.
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, data sets.
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, datasets.
The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.
```{r}