CSV files are designed to be easily read by humans.
They're a good interchange format because they're very simple and they can be read by every tool under the sun.
But CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.
In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large datasets.
We'll use Apache Arrow via the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
But if you're starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet.
In general, it's hard to know what will work best, so in the early stages of your analysis we'd encourage you to try both and pick the one that works the best for you.
In this chapter, we'll continue to use the tidyverse, particularly dplyr, but we'll pair it with the arrow package which is designed specifically for working with large data.
```{r setup}
#| message: false
#| warning: false
library(tidyverse)
library(arrow)
```
Later in the chapter, we'll also see some connections between arrow and duckdb, so we'll also need dbplyr and duckdb.
We begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
I highly recommend using `curl::multi_download()` to get very large files as it's built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted.
The `ISBN` column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure.
Once the data has been scanned by `open_dataset()`, it records what it's found and stops; it will only read further rows as you specifically request them.
This metadata is what we see if we print `seattle_csv`:
```{r}
seattle_csv
```
The first line in the output tells you that `seattle_csv` is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed.
The remainder of the output tells you the column type that arrow has imputed for each column.
We can see what's actually in with `glimpse()`.
This reveals that there are \~41 million rows and 12 columns, and shows us a few values.
```{r glimpse-data}
#| cache: true
seattle_csv |> glimpse()
```
We can start to use this dataset with dplyr verbs, using `collect()` to force arrow to perform the computation and return some data.
For example, this code tells us the total number of checkouts per year:
The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.
### Advantages of parquet
Like CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it's a custom binary format designed specifically for the needs of big data.
Parquet relies on [efficient encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/) to keep file size down, and supports file compression.
This helps make parquet files fast because there's less data to move from disk to memory.
- Parquet files have a rich type system.
As we talked about in @sec-col-types, a CSV file does not provide any information about column types.
For example, a CSV reader has to guess whether `"08-10-2022"` should be parsed as a string or a date.
In contrast, parquet files store data in a way that records the type along with the data.
- Parquet files are "chunked", which makes it possible to work on different parts of the file at the same time, and, if you're lucky, to skip some chunks altogether.
There's one primary disadvantage to parquet files: they are no longer "human readable", i.e. if you look at a parquet file using `readr::read_file()`, you'll just see a bunch of gibberish.
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.
There are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data.
You're likely to need to do some experimentation before you find the ideal partitioning for your situation.
As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files.
You should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.
### Rewriting the Seattle library data
Let's apply these ideas to the Seattle library data to see how they play out in practice.
We're going to partition by `CheckoutYear`, since it's likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.
To rewrite the data we define the partition using `dplyr::group_by()` and then save the partitions to a directory with `arrow::write_dataset()`.
`write_dataset()` has two important arguments: a directory where we'll create the files and the format we'll use.
```{r}
pq_path <- "data/seattle-library-checkouts"
```
```{r write-dataset}
#| eval: !expr "!file.exists(pq_path)"
seattle_csv |>
group_by(CheckoutYear) |>
write_dataset(path = pq_path, format = "parquet")
```
This takes about a minute to run; as we'll see shortly this is an initial investment that pays off by making future operations much much faster.
Our single 9GB CSV file has been rewritten into 18 parquet files.
The file names use a "self-describing" convention used by the [Apache Hive](https://hive.apache.org) project.
Hive-style partitions name folders with a "key=value" convention, so as you might guess, the `CheckoutYear=2005` directory contains all the data where `CheckoutYear` is 2005.
Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file.
This is as we expect since parquet is a much more efficient format.
## Using dplyr with arrow
Now we've created these parquet files, we'll need to read them in again.
We use `open_dataset()` again, but this time we give it a directory:
```{r}
seattle_pq <- open_dataset(pq_path)
```
Now we can write our dplyr pipeline.
For example, we could count the total number of books checked out in each month for the last five years:
Writing dplyr code for arrow data is conceptually similar to dbplyr, @sec-import-databases: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call `collect()`.
If we print out the `query` object we can see a little information about what we expect Arrow to return when the execution takes place:
```{r}
query
```
And we can get the results by calling `collect()`:
```{r books-by-year}
query |> collect()
```
Like dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would.
However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in `?acero`.
### Performance {#sec-parquet-fast}
Let's take a quick look at the performance impact of switching from CSV to parquet.
First, let's time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:
The \~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:
- Partitioning improves performance because this query uses `CheckoutYear == 2021` to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.
- The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (`CheckoutYear`, `MaterialType`, `CheckoutMonth`, and `Checkouts`).
This massive difference in performance is why it pays off to convert large CSVs to parquet!
There's one last advantage of parquet and arrow --- it's very easy to turn an arrow dataset into a DuckDB database (@sec-import-databases) by calling `arrow::to_duckdb()`:
The neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.
Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.