Change `data set` to `dataset` (#1282)
- It changes `data set(s)` to `dataset(s)` for consistency, throughout the book. - It adds `# Left` and `# Right` comments for similar side-by-side plots.
This commit is contained in:
parent
61a4ce719d
commit
5cfe902d8c
10
arrow.qmd
10
arrow.qmd
|
@ -18,7 +18,7 @@ They're a good interchange format because they're very simple and they can be re
|
|||
But CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.
|
||||
In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.
|
||||
|
||||
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large data sets.
|
||||
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large datasets.
|
||||
We'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
|
||||
As an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.
|
||||
|
||||
|
@ -49,7 +49,7 @@ library(duckdb)
|
|||
|
||||
## Getting the data
|
||||
|
||||
We begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
|
||||
We begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
|
||||
This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.
|
||||
|
||||
The following code will get you a cached copy of the data.
|
||||
|
@ -88,7 +88,7 @@ seattle_csv <- open_dataset(
|
|||
(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)
|
||||
|
||||
What happens when this code is run?
|
||||
`open_dataset()` will scan a few thousand rows to figure out the structure of the data set.
|
||||
`open_dataset()` will scan a few thousand rows to figure out the structure of the dataset.
|
||||
Then it records what it's found and stops; it will only read further rows as you specifically request them.
|
||||
This metadata is what we see if we print `seattle_csv`:
|
||||
|
||||
|
@ -152,7 +152,7 @@ This means that:
|
|||
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
|
||||
When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.
|
||||
|
||||
There are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data.
|
||||
There are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data.
|
||||
You're likely to need to do some experimentation before you find the ideal partitioning for your situation.
|
||||
As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files.
|
||||
You should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.
|
||||
|
@ -247,7 +247,7 @@ seattle_csv |>
|
|||
system.time()
|
||||
```
|
||||
|
||||
Now let's use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:
|
||||
Now let's use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:
|
||||
|
||||
```{r dataset-performance-multiple-parquet}
|
||||
#| cache: true
|
||||
|
|
|
@ -774,11 +774,13 @@ Compare the following two plots:
|
|||
#| fig-height: 3
|
||||
#| message: false
|
||||
|
||||
# Left
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth() +
|
||||
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
|
||||
|
||||
# Right
|
||||
mpg |>
|
||||
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
|
||||
ggplot(aes(x = displ, y = hwy)) +
|
||||
|
@ -799,9 +801,11 @@ For example, if we extract two classes of cars and plot them separately, it's di
|
|||
suv <- mpg |> filter(class == "suv")
|
||||
compact <- mpg |> filter(class == "compact")
|
||||
|
||||
# Left
|
||||
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point()
|
||||
|
||||
# Right
|
||||
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point()
|
||||
```
|
||||
|
@ -817,12 +821,14 @@ x_scale <- scale_x_continuous(limits = range(mpg$displ))
|
|||
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
|
||||
col_scale <- scale_color_discrete(limits = unique(mpg$drv))
|
||||
|
||||
# Left
|
||||
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point() +
|
||||
x_scale +
|
||||
y_scale +
|
||||
col_scale
|
||||
|
||||
# Right
|
||||
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point() +
|
||||
x_scale +
|
||||
|
|
|
@ -663,7 +663,7 @@ diamonds |>
|
|||
### More variables
|
||||
|
||||
It's straightforward to add more variables to the mix.
|
||||
For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:
|
||||
For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||
|
|
|
@ -57,7 +57,7 @@ Visualizations can surprise you, and they don't scale particularly well because
|
|||
**Models** are complementary tools to visualization.
|
||||
Once you have made your questions sufficiently precise, you can use a model to answer them.
|
||||
Models are a fundamentally mathematical or computational tool, so they generally scale well.
|
||||
Even when they don\'t, it\'s usually cheaper to buy more computers than it is to buy more brains!
|
||||
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
|
||||
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
|
||||
That means a model cannot fundamentally surprise you.
|
||||
|
||||
|
@ -105,7 +105,7 @@ This book doesn't teach data.table because it has a very concise interface that
|
|||
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
|
||||
|
||||
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
|
||||
While the complete data set might be big, often, the data needed to answer a specific question is small.
|
||||
While the complete dataset might be big, often, the data needed to answer a specific question is small.
|
||||
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
|
||||
The challenge here is finding the right small data, which often requires a lot of iteration.
|
||||
|
||||
|
|
|
@ -106,7 +106,7 @@ sqrt(-1)
|
|||
|
||||
So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
|
||||
But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
|
||||
Let's illustrate the difference with a simple data set that records the price of some stock each quarter:
|
||||
Let's illustrate the difference with a simple dataset that records the price of some stock each quarter:
|
||||
|
||||
```{r}
|
||||
stocks <- tibble(
|
||||
|
|
|
@ -224,7 +224,7 @@ In R, `%/%` does integer division and `%%` computes the remainder:
|
|||
1:10 %% 3
|
||||
```
|
||||
|
||||
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
|
||||
Modular arithmetic is handy for the `flights` dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
|
|
@ -7,7 +7,7 @@ source("_common.R")
|
|||
```
|
||||
|
||||
Our goal in this part of the book is to give you a rapid overview of the main tools of data science: **importing**, **tidying**, **transforming**, and **visualizing data**, as shown in @fig-ds-whole-game.
|
||||
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, data sets.
|
||||
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, datasets.
|
||||
The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.
|
||||
|
||||
```{r}
|
||||
|
|
Loading…
Reference in New Issue