Change `data set` to `dataset` (#1282)
- It changes `data set(s)` to `dataset(s)` for consistency, throughout the book. - It adds `# Left` and `# Right` comments for similar side-by-side plots.
This commit is contained in:
parent
61a4ce719d
commit
5cfe902d8c
10
arrow.qmd
10
arrow.qmd
|
@ -18,7 +18,7 @@ They're a good interchange format because they're very simple and they can be re
|
||||||
But CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.
|
But CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.
|
||||||
In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.
|
In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.
|
||||||
|
|
||||||
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large data sets.
|
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large datasets.
|
||||||
We'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
|
We'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
|
||||||
As an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.
|
As an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.
|
||||||
|
|
||||||
|
@ -49,7 +49,7 @@ library(duckdb)
|
||||||
|
|
||||||
## Getting the data
|
## Getting the data
|
||||||
|
|
||||||
We begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
|
We begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
|
||||||
This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.
|
This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.
|
||||||
|
|
||||||
The following code will get you a cached copy of the data.
|
The following code will get you a cached copy of the data.
|
||||||
|
@ -88,7 +88,7 @@ seattle_csv <- open_dataset(
|
||||||
(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)
|
(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)
|
||||||
|
|
||||||
What happens when this code is run?
|
What happens when this code is run?
|
||||||
`open_dataset()` will scan a few thousand rows to figure out the structure of the data set.
|
`open_dataset()` will scan a few thousand rows to figure out the structure of the dataset.
|
||||||
Then it records what it's found and stops; it will only read further rows as you specifically request them.
|
Then it records what it's found and stops; it will only read further rows as you specifically request them.
|
||||||
This metadata is what we see if we print `seattle_csv`:
|
This metadata is what we see if we print `seattle_csv`:
|
||||||
|
|
||||||
|
@ -152,7 +152,7 @@ This means that:
|
||||||
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
|
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
|
||||||
When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.
|
When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.
|
||||||
|
|
||||||
There are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data.
|
There are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data.
|
||||||
You're likely to need to do some experimentation before you find the ideal partitioning for your situation.
|
You're likely to need to do some experimentation before you find the ideal partitioning for your situation.
|
||||||
As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files.
|
As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files.
|
||||||
You should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.
|
You should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.
|
||||||
|
@ -247,7 +247,7 @@ seattle_csv |>
|
||||||
system.time()
|
system.time()
|
||||||
```
|
```
|
||||||
|
|
||||||
Now let's use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:
|
Now let's use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:
|
||||||
|
|
||||||
```{r dataset-performance-multiple-parquet}
|
```{r dataset-performance-multiple-parquet}
|
||||||
#| cache: true
|
#| cache: true
|
||||||
|
|
|
@ -774,11 +774,13 @@ Compare the following two plots:
|
||||||
#| fig-height: 3
|
#| fig-height: 3
|
||||||
#| message: false
|
#| message: false
|
||||||
|
|
||||||
|
# Left
|
||||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||||
geom_point(aes(color = class)) +
|
geom_point(aes(color = class)) +
|
||||||
geom_smooth() +
|
geom_smooth() +
|
||||||
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
|
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
|
||||||
|
|
||||||
|
# Right
|
||||||
mpg |>
|
mpg |>
|
||||||
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
|
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
|
||||||
ggplot(aes(x = displ, y = hwy)) +
|
ggplot(aes(x = displ, y = hwy)) +
|
||||||
|
@ -799,9 +801,11 @@ For example, if we extract two classes of cars and plot them separately, it's di
|
||||||
suv <- mpg |> filter(class == "suv")
|
suv <- mpg |> filter(class == "suv")
|
||||||
compact <- mpg |> filter(class == "compact")
|
compact <- mpg |> filter(class == "compact")
|
||||||
|
|
||||||
|
# Left
|
||||||
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
||||||
geom_point()
|
geom_point()
|
||||||
|
|
||||||
|
# Right
|
||||||
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
||||||
geom_point()
|
geom_point()
|
||||||
```
|
```
|
||||||
|
@ -817,12 +821,14 @@ x_scale <- scale_x_continuous(limits = range(mpg$displ))
|
||||||
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
|
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
|
||||||
col_scale <- scale_color_discrete(limits = unique(mpg$drv))
|
col_scale <- scale_color_discrete(limits = unique(mpg$drv))
|
||||||
|
|
||||||
|
# Left
|
||||||
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
||||||
geom_point() +
|
geom_point() +
|
||||||
x_scale +
|
x_scale +
|
||||||
y_scale +
|
y_scale +
|
||||||
col_scale
|
col_scale
|
||||||
|
|
||||||
|
# Right
|
||||||
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
||||||
geom_point() +
|
geom_point() +
|
||||||
x_scale +
|
x_scale +
|
||||||
|
|
|
@ -663,7 +663,7 @@ diamonds |>
|
||||||
### More variables
|
### More variables
|
||||||
|
|
||||||
It's straightforward to add more variables to the mix.
|
It's straightforward to add more variables to the mix.
|
||||||
For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:
|
For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||||
|
|
|
@ -57,7 +57,7 @@ Visualizations can surprise you, and they don't scale particularly well because
|
||||||
**Models** are complementary tools to visualization.
|
**Models** are complementary tools to visualization.
|
||||||
Once you have made your questions sufficiently precise, you can use a model to answer them.
|
Once you have made your questions sufficiently precise, you can use a model to answer them.
|
||||||
Models are a fundamentally mathematical or computational tool, so they generally scale well.
|
Models are a fundamentally mathematical or computational tool, so they generally scale well.
|
||||||
Even when they don\'t, it\'s usually cheaper to buy more computers than it is to buy more brains!
|
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
|
||||||
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
|
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
|
||||||
That means a model cannot fundamentally surprise you.
|
That means a model cannot fundamentally surprise you.
|
||||||
|
|
||||||
|
@ -105,7 +105,7 @@ This book doesn't teach data.table because it has a very concise interface that
|
||||||
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
|
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
|
||||||
|
|
||||||
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
|
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
|
||||||
While the complete data set might be big, often, the data needed to answer a specific question is small.
|
While the complete dataset might be big, often, the data needed to answer a specific question is small.
|
||||||
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
|
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
|
||||||
The challenge here is finding the right small data, which often requires a lot of iteration.
|
The challenge here is finding the right small data, which often requires a lot of iteration.
|
||||||
|
|
||||||
|
|
|
@ -106,7 +106,7 @@ sqrt(-1)
|
||||||
|
|
||||||
So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
|
So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
|
||||||
But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
|
But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
|
||||||
Let's illustrate the difference with a simple data set that records the price of some stock each quarter:
|
Let's illustrate the difference with a simple dataset that records the price of some stock each quarter:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
stocks <- tibble(
|
stocks <- tibble(
|
||||||
|
|
|
@ -224,7 +224,7 @@ In R, `%/%` does integer division and `%%` computes the remainder:
|
||||||
1:10 %% 3
|
1:10 %% 3
|
||||||
```
|
```
|
||||||
|
|
||||||
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
|
Modular arithmetic is handy for the `flights` dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights |>
|
flights |>
|
||||||
|
|
|
@ -7,7 +7,7 @@ source("_common.R")
|
||||||
```
|
```
|
||||||
|
|
||||||
Our goal in this part of the book is to give you a rapid overview of the main tools of data science: **importing**, **tidying**, **transforming**, and **visualizing data**, as shown in @fig-ds-whole-game.
|
Our goal in this part of the book is to give you a rapid overview of the main tools of data science: **importing**, **tidying**, **transforming**, and **visualizing data**, as shown in @fig-ds-whole-game.
|
||||||
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, data sets.
|
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, datasets.
|
||||||
The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.
|
The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
|
Loading…
Reference in New Issue