More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,13 +1,13 @@
<section data-type="chapter" id="chp-arrow">
<h1><span id="sec-arrow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Arrow</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="arrow-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>CSV files are designed to be easily read by humans. Theyre a good interchange format because theyre very simple and they can be read by every tool under the sun. But CSV files arent very efficient: you have to do quite a lot of work to read the data into R. In this chapter, youll learn about a powerful alternative: the <a href="https://parquet.apache.org/">parquet format</a>, an open standards-based format widely used by big data systems.</p>
<p>Well pair parquet files with <a href="https://arrow.apache.org">Apache Arrow</a>, a multi-language toolbox designed for efficient analysis and transport of large data sets. Well use Apache Arrow via the the <a href="https://arrow.apache.org/docs/r/">arrow package</a>, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: youll see some examples later in the chapter.</p>
<p>Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and youll want to work with it as is. But if youre starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, its hard to know what will work best, so in the early stages of your analysis wed encourage you to try both and pick the one that works the best for you.</p>
<section id="prerequisites" data-type="sect2">
<section id="arrow-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well continue to use the tidyverse, particularly dplyr, but well pair it with the arrow package which is designed specifically for working with large data.</p>
@@ -272,7 +272,7 @@ Using dbplyr with arrow</h2>
</section>
</section>
<section id="summary" data-type="sect1">
<section id="arrow-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format thats designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.</p>