Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -3,7 +3,8 @@
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, youll learn how to read plain-text rectangular files into R.</p>
<p>Working with data provided by R packages is a great way to learn data science tools, but you want to apply what youve learned to your own data at some point. In this chapter, youll learn the basics of reading data files into R.</p>
<p>Specifically, this chapter will focus on reading plain-text rectangular files. Well start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, youll learn how to handcraft data frames in R.</p>
<section id="prerequisites" data-type="sect2">
<h2>
@@ -18,7 +19,7 @@ Prerequisites</h2>
<section id="reading-data-from-a-file" data-type="sect1">
<h1>
Reading data from a file</h1>
<p>To begin well focus on the most rectangular data file type: the CSV, short for comma separate values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows give the data.</p>
<p>To begin, well focus on the most rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data.</p>
<div class="cell">
<pre><code>#&gt; Student ID,Full Name,favourite.food,mealPlan,AGE
#&gt; 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
@@ -83,13 +84,13 @@ Reading data from a file</h1>
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
<p>When you run <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and well come back to in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
<p>When you run <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and well return to it in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
<section id="practical-advice" data-type="sect2">
<h2>
Practical advice</h2>
<p>Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Lets take another look at the <code>students</code> data with that in mind.</p>
<p>In the <code>favourite.food</code> column, there are a bunch of food items and then the character string <code>N/A</code>, which should have been an real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
<p>In the <code>favourite.food</code> column, there are a bunch of food items, and then the character string <code>N/A</code>, which should have been a real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students &lt;- read_csv("data/students.csv", na = c("N/A", ""))
@@ -104,7 +105,7 @@ students
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by back ticks. Thats because they contain spaces, breaking Rs usual rules for variable names. To refer to them, you need to use those back ticks:</p>
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by backticks. Thats because they contain spaces, breaking Rs usual rules for variable names. To refer to them, you need to use those backticks:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students |&gt;
rename(
@@ -134,7 +135,7 @@ students
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represent as factor:</p>
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represented as a factor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students |&gt;
janitor::clean_names() |&gt;
@@ -151,8 +152,8 @@ students
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>Note that the values in the <code>meal_type</code> variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<code>&lt;chr&gt;</code>) to factor (<code>&lt;fct&gt;</code>). Youll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
<p>Before you move on to analyzing these data, youll probably want to fix the <code>age</code> column as well: currently its a character variable because of the one observation that is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
<p>Note that the values in the <code>meal_type</code> variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<code>&lt;chr&gt;</code>) to factor (<code>&lt;fct&gt;</code>). Youll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
<p>Before you analyze these data, youll probably want to fix the <code>age</code> column. Currently, its a character variable because one of the observations is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students &lt;- students |&gt;
janitor::clean_names() |&gt;
@@ -177,7 +178,7 @@ students
<section id="other-arguments" data-type="sect2">
<h2>
Other arguments</h2>
<p>There are a couple of other important arguments that we need to mention, and theyll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read csv files that youve created in a string:</p>
<p>There are a couple of other important arguments that we need to mention, and theyll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read CSV files that youve created in a string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"a,b,c
@@ -190,7 +191,7 @@ Other arguments</h2>
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>Usually <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
<p>Usually, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But its not uncommon for a few lines of metadata to be included at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"The first line of metadata
@@ -215,7 +216,7 @@ read_csv(
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3</pre>
</div>
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings, and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"1,2,3
@@ -228,7 +229,7 @@ read_csv(
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>Alternatively you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
<p>Alternatively, you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"1,2,3
@@ -241,19 +242,19 @@ read_csv(
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>These arguments are all you need to know to read the majority of CSV files that youll encounter in practice. (For the rest, youll need to carefully inspect your <code>.csv</code> file and carefully read the documentation for <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>s many other arguments.)</p>
<p>These arguments are all you need to know to read the majority of CSV files that youll encounter in practice. (For the rest, youll need to carefully inspect your <code>.csv</code> file and read the documentation for <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>s many other arguments.)</p>
</section>
<section id="other-file-types" data-type="sect2">
<h2>
Other file types</h2>
<p>Once youve mastered <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, using readrs other functions is straightforward; its just a matter of knowing which function to reach for:</p>
<ul><li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon separated files. These use <code>;</code> instead of <code>,</code> to separate fields, and are common in countries that use <code>,</code> as the decimal marker.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> reads tab delimited files.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimited if you dont specify it.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed width files. You can specify fields either by their widths with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or their position with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed width files where columns are separated by white space.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache style log files.</p></li>
<ul><li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon-separated files. These use <code>;</code> instead of <code>,</code> to separate fields and are common in countries that use <code>,</code> as the decimal marker.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> reads tab-delimited files.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimiter if you dont specify it.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed-width files. You can specify fields by their widths with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or by their positions with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed-width files where columns are separated by white space.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache-style log files.</p></li>
</ul></section>
<section id="exercises" data-type="sect2">
@@ -263,7 +264,7 @@ Exercises</h2>
<li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> have in common?</p></li>
<li><p>What are the most important arguments to <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code>?</p></li>
<li>
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. What argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify to read the following text into a data frame?</p>
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. To read the following text into a data frame, what argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">"x,y\n1,'a,b'"</pre>
</div>
@@ -281,9 +282,9 @@ read_csv("a;b\n1;3")</pre>
<li>
<p>Practice referring to non-syntactic names in the following data frame by:</p>
<ol type="a"><li>Extracting the variable called <code>1</code>.</li>
<li>Plotting a scatterplot of <code>1</code> vs <code>2</code>.</li>
<li>Creating a new column called <code>3</code> which is <code>2</code> divided by <code>1</code>.</li>
<li>Renaming the columns to <code>one</code>, <code>two</code> and <code>three</code>.</li>
<li>Plotting a scatterplot of <code>1</code> vs. <code>2</code>.</li>
<li>Creating a new column called <code>3</code>, which is <code>2</code> divided by <code>1</code>.</li>
<li>Renaming the columns to <code>one</code>, <code>two</code>, and <code>three</code>.</li>
</ol><div class="cell">
<pre data-type="programlisting" data-code-language="r">annoying &lt;- tibble(
`1` = 1:10,
@@ -297,15 +298,15 @@ read_csv("a;b\n1;3")</pre>
<section id="sec-col-types" data-type="sect1">
<h1>
Controlling column types</h1>
<p>A CSV file doesnt contain any information about the type of each variable (i.e. whether its a logical, number, string, etc), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself. Finally, well mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.</p>
<p>A CSV file doesnt contain any information about the type of each variable (i.e., whether its a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, well mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.</p>
<section id="guessing-types" data-type="sect2">
<h2>
Guessing types</h2>
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<span data-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring an missing values. It then works through the following questions:</p>
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<span data-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:</p>
<ul><li>Does it contain only <code>F</code>, <code>T</code>, <code>FALSE</code>, or <code>TRUE</code> (ignoring case)? If so, its a logical.</li>
<li>Does it contain only numbers (e.g. <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, its a number.</li>
<li>Does it match match the ISO8601 standard? If so, its a date or date-time. (Well come back to date/times in more detail in <a href="#sec-creating-datetimes" data-type="xref">#sec-creating-datetimes</a>).</li>
<li>Does it contain only numbers (e.g., <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, its a number.</li>
<li>Does it match the ISO8601 standard? If so, its a date or date-time. (Well return to date-times in more detail in <a href="#sec-creating-datetimes" data-type="xref">#sec-creating-datetimes</a>).</li>
<li>Otherwise, it must be a string.</li>
</ul><p>You can see that behavior in action in this simple example:</p>
<div class="cell">
@@ -332,13 +333,13 @@ Guessing types</h2>
#&gt; 2 FALSE 4.5 2021-02-15 def
#&gt; 3 TRUE Inf 2021-02-16 ghi</pre>
</div>
<p>This heuristic works well if you have a clean dataset, but in real life youll encounter a selection of weird and wonderful failures.</p>
<p>This heuristic works well if you have a clean dataset, but in real life, youll encounter a selection of weird and beautiful failures.</p>
</section>
<section id="missing-values-column-types-and-problems" data-type="sect2">
<h2>
Missing values, column types, and problems</h2>
<p>The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type. One of the most common causes for this a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
<p>The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
<p>Take this simple 1 column CSV file as an example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
@@ -359,7 +360,7 @@ Missing values, column types, and problems</h2>
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled amongst them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled among them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- read_csv(csv, col_types = list(x = col_double()))
#&gt; Warning: One or more parsing issues, call `problems()` on your data frame for
@@ -371,9 +372,9 @@ Missing values, column types, and problems</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">problems(df)
#&gt; # A tibble: 1 × 5
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/RtmpZYGhlj/file9e8176037b8c</pre>
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/Rtmp1nE0XP/file11b88112257a4</pre>
</div>
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
<div class="cell">
@@ -395,11 +396,11 @@ Column types</h2>
<ul><li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_logical()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_double()</a></code> read logicals and real numbers. Theyre relatively rarely needed (except as above), since readr will usually guess them for you.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish because integers and doubles in this book because theyre functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish integers and doubles in this book because theyre functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_character()</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesnt make sense to (e.g.) divide it in half.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates and date-time respectively; youll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
<code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code>, and <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates, and date-times respectively; youll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_number.html">col_number()</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. Youll learn more about it in <a href="#chp-numbers" data-type="xref">#chp-numbers</a>.</li>
<li>
@@ -498,7 +499,7 @@ read_csv("students-2.csv")
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:</p>
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternative:</p>
<ol type="1"><li>
<p><code><a href="https://readr.tidyverse.org/reference/read_rds.html">write_rds()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_rds.html">read_rds()</a></code> are uniform wrappers around the base functions <code><a href="https://rdrr.io/r/base/readRDS.html">readRDS()</a></code> and <code><a href="https://rdrr.io/r/base/readRDS.html">saveRDS()</a></code>. These store data in Rs custom binary format called RDS:</p>
<div class="cell">
@@ -516,7 +517,7 @@ read_rds("students.rds")
</div>
</li>
<li>
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:</p>
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. Well return to arrow in more depth in <a href="#chp-arrow" data-type="xref">#chp-arrow</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(arrow)
write_parquet(students, "students.parquet")
@@ -532,7 +533,7 @@ read_parquet("students.parquet")
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
</ol><p>Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package.</p>
</ol><p>Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.</p>
</section>
<section id="data-entry" data-type="sect1">
@@ -586,7 +587,7 @@ Data entry</h1>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
<p>In this chapter, youve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-arrow" data-type="xref">#chp-arrow</a> from parquet files, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
<p>Now that youre writing a substantial amount of R code, its time to learn more about organizing your code into files and directories. In the next chapter, youll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p>