Fix code language

This commit is contained in:
Hadley Wickham
2022-11-18 11:26:25 -06:00
parent 69b4597f3b
commit 868a35ca71
29 changed files with 912 additions and 907 deletions

View File

@@ -16,7 +16,7 @@ Excel</h1>
Prerequisites</h2>
<p>In this chapter, youll learn how to load data from Excel spreadsheets in R with the <strong>readxl</strong> package. This package is non-core tidyverse, so you need to load it explicitly but it is installed automatically when you install the tidyverse package.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(readxl)
<pre data-type="programlisting" data-code-language="r">library(readxl)
library(tidyverse)</pre>
</div>
<p><strong>xlsx</strong> and <strong>XLConnect</strong> can be used for reading data from and writing data to Excel spreadsheets. However, these two packages require Java installed on your machine and the rJava package. Due to potential challenges with installation, we recommend using alternative packages weve introduced in this chapter.</p>
@@ -49,11 +49,11 @@ Reading spreadsheets</h2>
</div>
<p>The first argument to <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> is the path to the file to read.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_excel("data/students.xlsx")</pre>
<pre data-type="programlisting" data-code-language="r">students &lt;- read_excel("data/students.xlsx")</pre>
</div>
<p><code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> will read the file in as a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students
<pre data-type="programlisting" data-code-language="r">students
#&gt; # A tibble: 6 × 5
#&gt; `Student ID` `Full Name` favourite.food mealPlan AGE
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
@@ -68,7 +68,7 @@ Reading spreadsheets</h2>
<ol type="1"><li>
<p>The column names are all over the place. You can provide column names that follow a consistent format; we recommend <code>snake_case</code> using the <code>col_names</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
<pre data-type="programlisting" data-code-language="r">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
)
@@ -85,7 +85,7 @@ Reading spreadsheets</h2>
</div>
<p>Unfortunately, this didnt quite do the trick. You now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the <code>skip</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
<pre data-type="programlisting" data-code-language="r">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1
@@ -104,7 +104,7 @@ Reading spreadsheets</h2>
<li>
<p>In the <code>favourite_food</code> column, one of the observations is <code>N/A</code>, which stands for “not available” but its currently not recognized as an <code>NA</code> (note the contrast between this <code>N/A</code> and the age of the fourth student in the list). You can specify which character strings should be recognized as <code>NA</code>s with the <code>na</code> argument. By default, only <code>""</code> (empty string, or, in the case of reading from a spreadsheet, an empty cell) is recognized as an <code>NA</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
<pre data-type="programlisting" data-code-language="r">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
@@ -124,7 +124,7 @@ Reading spreadsheets</h2>
<li>
<p>One other remaining issue is that <code>age</code> is read in as a character variable, but it really should be numeric. Just like with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and friends for reading data from flat files, you can supply a <code>col_types</code> argument to <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are <code>"skip"</code>, <code>"guess"</code>, <code>"logical"</code>, <code>"numeric"</code>, <code>"date"</code>, <code>"text"</code> or <code>"list"</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
<pre data-type="programlisting" data-code-language="r">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
@@ -144,7 +144,7 @@ Reading spreadsheets</h2>
</div>
<p>However, this didnt quite produce the desired result either. By specifying that <code>age</code> should be numeric, we have turned the one cell with the non-numeric entry (which had the value <code>five</code>) into an <code>NA</code>. In this case, we should read age in as <code>"text"</code> and then make the change once the data is loaded in R.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_excel(
<pre data-type="programlisting" data-code-language="r">students &lt;- read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
@@ -187,7 +187,7 @@ Reading individual sheets</h2>
</div>
<p>You can read a single sheet from a spreadsheet with the <code>sheet</code> argument in <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
<pre data-type="programlisting" data-code-language="r">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_dep…¹ flipp…² body_…³ sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
@@ -202,7 +202,7 @@ Reading individual sheets</h2>
</div>
<p>Some variables that appear to contain numerical data are read in as characters due to the character string <code>"NA"</code> not being recognized as a true <code>NA</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_torgersen &lt;- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
<pre data-type="programlisting" data-code-language="r">penguins_torgersen &lt;- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
penguins_torgersen
#&gt; # A tibble: 52 × 8
@@ -219,17 +219,17 @@ penguins_torgersen
</div>
<p>However, we cheated here a bit. We looked inside the Excel spreadsheet, which is not a recommended workflow. Instead, you can use <code><a href="https://readxl.tidyverse.org/reference/excel_sheets.html">excel_sheets()</a></code> to get information on all sheets in an Excel spreadsheet, and then read the one(s) youre interested in.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">excel_sheets("data/penguins.xlsx")
<pre data-type="programlisting" data-code-language="r">excel_sheets("data/penguins.xlsx")
#&gt; [1] "Torgersen Island" "Biscoe Island" "Dream Island"</pre>
</div>
<p>Once you know the names of the sheets, you can read them in individually with <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_biscoe &lt;- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
<pre data-type="programlisting" data-code-language="r">penguins_biscoe &lt;- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
penguins_dream &lt;- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")</pre>
</div>
<p>In this case the full penguins dataset is spread across three sheets in the spreadsheet. Each sheet has the same number of columns but different numbers of rows.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dim(penguins_torgersen)
<pre data-type="programlisting" data-code-language="r">dim(penguins_torgersen)
#&gt; [1] 52 8
dim(penguins_biscoe)
#&gt; [1] 168 8
@@ -238,7 +238,7 @@ dim(penguins_dream)
</div>
<p>We can put them together with <code><a href="https://dplyr.tidyverse.org/reference/bind_rows.html">bind_rows()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
<pre data-type="programlisting" data-code-language="r">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
@@ -269,7 +269,7 @@ Reading part of a sheet</h2>
</div>
<p>This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the <code><a href="https://readxl.tidyverse.org/reference/readxl_example.html">readxl_example()</a></code> function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> as usual.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">deaths_path &lt;- readxl_example("deaths.xlsx")
<pre data-type="programlisting" data-code-language="r">deaths_path &lt;- readxl_example("deaths.xlsx")
deaths &lt;- read_excel(deaths_path)
#&gt; New names:
#&gt; • `` -&gt; `...2`
@@ -292,7 +292,7 @@ deaths
<p>The top three rows and the bottom four rows are not part of the data frame.</p>
<p>We could skip the top three rows with <code>skip</code>. Note that we set <code>skip = 4</code> since the fourth row contains column names, not the data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4)
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, skip = 4)
#&gt; # A tibble: 14 × 6
#&gt; Name Profession Age `Has kids` `Date of birth` Date of dea…¹
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt; &lt;chr&gt;
@@ -306,7 +306,7 @@ deaths
</div>
<p>We could also set <code>n_max</code> to omit the extraneous rows at the bottom.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4, n_max = 10)
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, skip = 4, n_max = 10)
#&gt; # A tibble: 10 × 6
#&gt; Name Profe…¹ Age Has k…² `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt; &lt;dttm&gt;
@@ -324,19 +324,19 @@ deaths
<ul><li>
<p>Supply this information to the <code>range</code> argument:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = "A5:F15")</pre>
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, range = "A5:F15")</pre>
</div>
</li>
<li>
<p>Specify rows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = cell_rows(c(5, 15)))</pre>
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, range = cell_rows(c(5, 15)))</pre>
</div>
</li>
<li>
<p>Specify cells that mark the top-left and bottom-right corners of the data the top-left corner, <code>A5</code>, translates to <code>c(5, 1)</code> (5th row down, 1st column) and the bottom-right corner, <code>F15</code>, translates to <code>c(15, 6)</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))</pre>
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))</pre>
</div>
</li>
</ul><p>If you have control over the sheet, an even better way is to create a “named range”. This is useful within Excel because named ranges help repeat formulas easier to create and they have some useful properties for creating dynamic charts and graphs as well. Even if youre not working in Excel, named ranges can be useful for identifying which cells to read into R. In the example above, the table were reading in is named <code>Table1</code>, so we can read it in with the following.</p>
@@ -369,7 +369,7 @@ Data not in cell values</h2>
Writing to Excel</h2>
<p>Lets create a small data frame that we can then write out. Note that <code>item</code> is a factor and <code>quantity</code> is an integer.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">bake_sale &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">bake_sale &lt;- tibble(
item = factor(c("brownie", "cupcake", "cookie")),
quantity = c(10, 5, 8)
)
@@ -384,7 +384,7 @@ bake_sale
</div>
<p>You can write data back to disk as an Excel file using the <code><a href="https://docs.ropensci.org/writexl/reference/write_xlsx.html">write_xlsx()</a></code> from the <strong>writexl</strong> package.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(writexl)
<pre data-type="programlisting" data-code-language="r">library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
</div>
<p><a href="#fig-bake-sale-excel" data-type="xref">#fig-bake-sale-excel</a> shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting <code>col_names</code> and <code>format_headers</code> arguments to <code>FALSE</code>.</p>
@@ -398,7 +398,7 @@ write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
</div>
<p>Just like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see <a href="#sec-writing-to-a-file" data-type="xref">#sec-writing-to-a-file</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/bake-sale.xlsx")
<pre data-type="programlisting" data-code-language="r">read_excel("data/bake-sale.xlsx")
#&gt; # A tibble: 3 × 2
#&gt; item quantity
#&gt; &lt;chr&gt; &lt;dbl&gt;
@@ -414,7 +414,7 @@ Formatted output</h2>
<p>The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if youre interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the <strong>openxlsx</strong> package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions cant be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.</p>
<p>Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the <code>penguins</code> data frame.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(openxlsx)
<pre data-type="programlisting" data-code-language="r">library(openxlsx)
library(palmerpenguins)
# Create a workbook (spreadsheet)
@@ -444,7 +444,7 @@ writeDataTable(
</div>
<p>This creates a workbook object:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_species
<pre data-type="programlisting" data-code-language="r">penguins_species
#&gt; A Workbook object.
#&gt;
#&gt; Worksheets:
@@ -464,7 +464,7 @@ writeDataTable(
</div>
<p>And we can write this to this with <code><a href="https://rdrr.io/pkg/openxlsx/man/saveWorkbook.html">saveWorkbook()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre>
<pre data-type="programlisting" data-code-language="r">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre>
</div>
<p>The resulting spreadsheet is shown in <a href="#fig-penguins-species" data-type="xref">#fig-penguins-species</a>. By default, openxlsx formats the data as an Excel table.</p>
<div class="cell">