r4ds/data-import.qmd

# Data import {#sec-data-import}

```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```

## Introduction

Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you've learned to your own data at some point.
In this chapter, you'll learn the basics of reading data files into R.

Specifically, this chapter will focus on reading plain-text rectangular files.
We'll start with practical advice for handling features like column names, types, and missing data.
You will then learn about reading data from multiple files at once and writing data from R to a file.
Finally, you'll learn how to handcraft data frames in R.

### Prerequisites

In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.

```{r}
#| label: setup
#| message: false

library(tidyverse)
```

## Reading data from a file

To begin, we'll focus on the most rectangular data file type: CSV, which is short for comma-separated values.
Here is what a simple CSV file looks like.
The first row, commonly called the header row, gives the column names, and the following six rows provide the data.

```{r}
#| echo: false
#| message: false

read_lines("data/students.csv") |> cat(sep = "\n")
```

@tbl-students-table shows a representation of the same data as a table.

```{r}
#| label: tbl-students-table
#| echo: false
#| message: false
#| tbl-cap: Data from the students.csv file as a table.

read_csv("data/students.csv") |>
  knitr::kable()
```

We can read this file into R using `read_csv()`.
The first argument is the most important: it's the path to the file.

```{r}
#| message: true

students <- read_csv("data/students.csv")
```

When you run `read_csv()`, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about retrieving the full column specification and how to quiet this message.
This message is an integral part of readr, and we'll return to it in @sec-col-types.

### Practical advice

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
Let's take another look at the `students` data with that in mind.

In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.

```{r}
#| message: false
students <- read_csv("data/students.csv", na = c("N/A", ""))

students
```

You might also notice that the `Student ID` and `Full Name` columns are surrounded by backticks.
That's because they contain spaces, breaking R's usual rules for variable names.
To refer to them, you need to use those backticks:

```{r}
students |> 
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  )
```

An alternative approach is to use `janitor::clean_names()` to use some heuristics to turn them all into snake case at once[^data-import-1].

[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.

```{r}
#| message: false

students |> janitor::clean_names()
```

Another common task after reading in data is to consider variable types.
For example, `meal_plan` is a categorical variable with a known set of possible values, which in R should be represented as a factor:

```{r}
students |>
  janitor::clean_names() |>
  mutate(
    meal_plan = factor(meal_plan)
  )
```

Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
You'll learn more about factors in @sec-factors.

Before you analyze these data, you'll probably want to fix the `age` column.
Currently, it's a character variable because one of the observations is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in @sec-import-spreadsheets.

```{r}
students <- students |>
  janitor::clean_names() |>
  mutate(
    meal_plan = factor(meal_plan),
    age = parse_number(if_else(age == "five", "5", age))
  )

students
```

### Other arguments

There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read CSV files that you've created in a string:

```{r}
#| message: false

read_csv(
  "a,b,c
  1,2,3
  4,5,6"
)
```

Usually, `read_csv()` uses the first line of the data for the column names, which is a very common convention.
But it's not uncommon for a few lines of metadata to be included at the top of the file.
You can use `skip = n` to skip the first `n` lines or use `comment = "#"` to drop all lines that start with (e.g.) `#`:

```{r}
#| message: false

read_csv(
  "The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3",
  skip = 2
)

read_csv(
  "# A comment I want to skip
  x,y,z
  1,2,3",
  comment = "#"
)
```

In other cases, the data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings and instead label them sequentially from `X1` to `Xn`:

```{r}
#| message: false

read_csv(
  "1,2,3
  4,5,6",
  col_names = FALSE
)
```

Alternatively, you can pass `col_names` a character vector which will be used as the column names:

```{r}
#| message: false

read_csv(
  "1,2,3
  4,5,6",
  col_names = c("x", "y", "z")
)
```

These arguments are all you need to know to read the majority of CSV files that you'll encounter in practice.
(For the rest, you'll need to carefully inspect your `.csv` file and read the documentation for `read_csv()`'s many other arguments.)

### Other file types

Once you've mastered `read_csv()`, using readr's other functions is straightforward; it's just a matter of knowing which function to reach for:

-   `read_csv2()` reads semicolon-separated files.
    These use `;` instead of `,` to separate fields and are common in countries that use `,` as the decimal marker.

-   `read_tsv()` reads tab-delimited files.

-   `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimiter if you don't specify it.

-   `read_fwf()` reads fixed-width files.
    You can specify fields by their widths with `fwf_widths()` or by their positions with `fwf_positions()`.

-   `read_table()` reads a common variation of fixed-width files where columns are separated by white space.

-   `read_log()` reads Apache-style log files.

### Exercises

1.  What function would you use to read a file where fields were separated with "\|"?

2.  Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?

3.  What are the most important arguments to `read_fwf()`?

4.  Sometimes strings in a CSV file contain commas.
    To prevent them from causing problems, they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
    To read the following text into a data frame, what argument to `read_csv()` do you need to specify?

    ```{r}
    #| eval: false

    "x,y\n1,'a,b'"
    ```

5.  Identify what is wrong with each of the following inline CSV files.
    What happens when you run the code?

    ```{r}
    #| eval: false

    read_csv("a,b\n1,2,3\n4,5,6")
    read_csv("a,b,c\n1,2\n1,2,3,4")
    read_csv("a,b\n\"1")
    read_csv("a,b\n1,2\na,b")
    read_csv("a;b\n1;3")
    ```

6.  Practice referring to non-syntactic names in the following data frame by:

    a.  Extracting the variable called `1`.
    b.  Plotting a scatterplot of `1` vs. `2`.
    c.  Creating a new column called `3`, which is `2` divided by `1`.
    d.  Renaming the columns to `one`, `two`, and `three`.

    ```{r}
    annoying <- tibble(
      `1` = 1:10,
      `2` = `1` * 2 + rnorm(length(`1`))
    )
    ```

## Controlling column types {#sec-col-types}

A CSV file doesn't contain any information about the type of each variable (i.e. whether it's a logical, number, string, etc.), so readr will try to guess the type.
This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself.
Finally, we'll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.

### Guessing types

readr uses a heuristic to figure out the column types.
For each column, it pulls the values of 1,000[^data-import-2] rows spaced evenly from the first row to the last, ignoring missing values.
It then works through the following questions:

[^data-import-2]: You can override the default of 1000 with the `guess_max` argument.

-   Does it contain only `F`, `T`, `FALSE`, or `TRUE` (ignoring case)? If so, it's a logical.
-   Does it contain only numbers (e.g. `1`, `-4.5`, `5e6`, `Inf`)? If so, it's a number.
-   Does it match the ISO8601 standard? If so, it's a date or date-time. (We'll return to date-times in more detail in @sec-creating-datetimes).
-   Otherwise, it must be a string.

You can see that behavior in action in this simple example:

```{r}
read_csv("
  logical,numeric,date,string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi"
)
```

This heuristic works well if you have a clean dataset, but in real life, you'll encounter a selection of weird and beautiful failures.

### Missing values, column types, and problems

The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type.
One of the most common causes for this is a missing value, recorded using something other than the `NA` that stringr expects.

Take this simple 1 column CSV file as an example:

```{r}
csv <- "
  x
  10
  .
  20
  30"

```

If we read it without any additional arguments, `x` becomes a character column:

```{r}
df <- read_csv(csv)
```

In this very small case, you can easily see the missing value `.`.
But what happens if you have thousands of rows with only a few missing values represented by `.`s speckled among them?
One approach is to tell readr that `x` is a numeric column, and then see where it fails.
You can do that with the `col_types` argument, which takes a named list:

```{r}
df <- read_csv(csv, col_types = list(x = col_double()))
```

Now `read_csv()` reports that there was a problem, and tells us we can find out more with `problems()`:

```{r}
problems(df)
```

This tells us that there was a problem in row 3, col 1 where readr expected a double but got a `.`.
That suggests this dataset uses `.` for missing values.
So then we set `na = "."`, the automatic guessing succeeds, giving us the numeric column that we want:

```{r}
df <- read_csv(csv, na = ".")
```

### Column types

readr provides a total of nine column types for you to use:

-   `col_logical()` and `col_double()` read logicals and real numbers. They're relatively rarely needed (except as above), since readr will usually guess them for you.
-   `col_integer()` reads integers. We distinguish integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
-   `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half.
-   `col_factor()`, `col_date()`, and `col_datetime()` create factors, dates, and date-times respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-dates-and-times.
-   `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You'll learn more about it in @sec-numbers.
-   `col_skip()` skips a column so it's not included in the result.

It's also possible to override the default column by switching from `list()` to `cols()`:

```{r}
csv <- "
x,y,z
1,2,3"

read_csv(csv, col_types = cols(.default = col_character()))
```

Another useful helper is `cols_only()` which will read in only the columns you specify:

```{r}
read_csv(
  "x,y,z
  1,2,3",
  col_types = cols_only(x = col_character())
)
```

## Reading data from multiple files {#sec-readr-directory}

Sometimes your data is split across multiple files instead of being contained in a single file.
For example, you might have sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March.
With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame.

```{r}
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
```

With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.

If you have many files you want to read in, it can get cumbersome to write out their names as a list.
Instead, you can use the base `list.files()` function to find the files for you by matching a pattern in the file names.
You'll learn more about these patterns in @sec-regular-expressions.

```{r}
sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
sales_files
```

## Writing to a file {#sec-writing-to-a-file}

readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.

The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.

```{r}
#| eval: false

write_csv(students, "students.csv")
```

Now let's read that csv file back in.
Note that the type information is lost when you save to csv:

```{r}
#| warning: false
#| message: false

students
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
```

This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
There are two main alternative:

1.  `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
    These store data in R's custom binary format called RDS:

    ```{r}
    write_rds(students, "students.rds")
    read_rds("students.rds")
    ```

2.  The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages.
    We'll return to arrow in more depth in @sec-arrow.

    ```{r}
    #| eval: false

    library(arrow)
    write_parquet(students, "students.parquet")
    read_parquet("students.parquet")
    #> # A tibble: 6 × 5
    #>   student_id full_name        favourite_food     meal_plan             age
    #>        <dbl> <chr>            <chr>              <fct>               <dbl>
    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    #> 2          2 Barclay Lynn     French fries       Lunch only              5
    #> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
    #> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    #> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    ```

Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.

```{r}
#| include: false
file.remove("students-2.csv")
file.remove("students.rds")
```

## Data entry

Sometimes you'll need to assemble a tibble "by hand" doing a little data entry in your R script.
There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows.
`tibble()` works by column:

```{r}
tibble(
  x = c(1, 2, 5), 
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.60)
)
```

Note that every column in tibble must be same size, so you'll get an error if they're not:

```{r}
#| error: true

tibble(
  x = c(1, 2),
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.6)
)
```

Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for **tr**ansposed t**ibble**, which lets you lay out your data row by row.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:

```{r}
tribble(
  ~x, ~y, ~z,
  "h", 1, 0.08,
  "m", 2, 0.83,
  "g", 5, 0.60,
)
```

We'll use `tibble()` and `tribble()` later in the book to construct small examples to demonstrate how various functions work.

## Summary

In this chapter, you've learned how to load CSV files with `read_csv()` and to do your own data entry with `tibble()` and `tribble()`.
You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
We'll come to data import a few times in this book: @sec-import-spreadsheets from Excel and Google Sheets, @sec-import-databases will show you how to load data from databases, @sec-arrow from parquet files, @sec-rectangling from JSON, and @sec-scraping from websites.

We're just about at the end of this section of the book, but there's one important last topic to cover: how to get help.
So in the next chapter, you'll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								# Data import {#sec-data-import}
-												Make sure first element is heading

											
										
										
											2015-12-12 02:34:20 +08:00
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								```{r}
 								#| results: "asis"
 								#| echo: false
 								source("_common.R")
-												Convert all statuses to complete

											
										
										
											2023-01-23 21:55:51 +08:00
+								status("complete")
-												Add chapter status

											
										
										
											2021-05-04 21:10:39 +08:00
+								```
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								## Introduction
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you've learned to your own data at some point.
-												Whole game edits (#1184)

* Reflect new part structure

* Mention all chapters

* Hide the ruler

* Crossref diagram

* Fix crossref

* Mention all import chapters

* Fix link to following chapter

* Fix title and summary

* Add intros

* Consistent chunk style?
											
										
										
											2022-12-16 14:41:10 +08:00
+								In this chapter, you'll learn the basics of reading data files into R.
 								Specifically, this chapter will focus on reading plain-text rectangular files.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								We'll start with practical advice for handling features like column names, types, and missing data.
-												Whole game edits (#1184)

* Reflect new part structure

* Mention all chapters

* Hide the ruler

* Crossref diagram

* Fix crossref

* Mention all import chapters

* Fix link to following chapter

* Fix title and summary

* Add intros

* Consistent chunk style?
											
										
										
											2022-12-16 14:41:10 +08:00
+								You will then learn about reading data from multiple files at once and writing data from R to a file.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								Finally, you'll learn how to handcraft data frames in R.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								### Prerequisites
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| label: setup
 								#| message: false
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								library(tidyverse)
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								## Reading data from a file
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								To begin, we'll focus on the most rectangular data file type: CSV, which is short for comma-separated values.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								Here is what a simple CSV file looks like.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								The first row, commonly called the header row, gives the column names, and the following six rows provide the data.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| echo: false
 								#| message: false
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								read_lines("data/students.csv") |> cat(sep = "\n")
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								@tbl-students-table shows a representation of the same data as a table.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								#| label: tbl-students-table
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								#| echo: false
 								#| message: false
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								#| tbl-cap: Data from the students.csv file as a table.
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								read_csv("data/students.csv") |>
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								  knitr::kable()
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								We can read this file into R using `read_csv()`.
 								The first argument is the most important: it's the path to the file.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: true
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								students <- read_csv("data/students.csv")
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								When you run `read_csv()`, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
 								It also prints out some information about retrieving the full column specification and how to quiet this message.
 								This message is an integral part of readr, and we'll return to it in @sec-col-types.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
 								### Practical advice
 								Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
 								Let's take another look at the `students` data with that in mind.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available".
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								This is something we can address using the `na` argument.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								students <- read_csv("data/students.csv", na = c("N/A", ""))
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								students
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								You might also notice that the `Student ID` and `Full Name` columns are surrounded by backticks.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								That's because they contain spaces, breaking R's usual rules for variable names.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								To refer to them, you need to use those backticks:
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								```{r}
 								students |>
 								  rename(
 								    student_id = `Student ID`,
 								    full_name = `Full Name`
 								  )
 								```
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								An alternative approach is to use `janitor::clean_names()` to use some heuristics to turn them all into snake case at once[^data-import-1].
 								[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								```{r}
 								#| message: false
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								students |> janitor::clean_names()
 								```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								Another common task after reading in data is to consider variable types.
-												Minor Typo Fixes in Section 8.2.1 (#1287)

Changed meal_type to meal_plan to maintain consistency with the variable name in the data/code.
											
										
										
											2023-02-15 20:49:11 +08:00
+								For example, `meal_plan` is a categorical variable with a known set of possible values, which in R should be represented as a factor:
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								```{r}
 								students |>
 								  janitor::clean_names() |>
 								  mutate(
 								    meal_plan = factor(meal_plan)
 								  )
 								```
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
-												Minor Typo Fixes in Section 8.2.1 (#1287)

Changed meal_type to meal_plan to maintain consistency with the variable name in the data/code.
											
										
										
											2023-02-15 20:49:11 +08:00
+								Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								You'll learn more about factors in @sec-factors.
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								Before you analyze these data, you'll probably want to fix the `age` column.
 								Currently, it's a character variable because one of the observations is typed out as `five` instead of a numeric `5`.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								We discuss the details of fixing this issue in @sec-import-spreadsheets.
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								```{r}
 								students <- students |>
 								  janitor::clean_names() |>
 								  mutate(
 								    meal_plan = factor(meal_plan),
 								    age = parse_number(if_else(age == "five", "5", age))
 								  )
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								students
 								```
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								### Other arguments
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read CSV files that you've created in a string:
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								read_csv(
 								  "a,b,c
 ,2,3
 ,5,6"
 								)
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								```
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								Usually, `read_csv()` uses the first line of the data for the column names, which is a very common convention.
 								But it's not uncommon for a few lines of metadata to be included at the top of the file.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								You can use `skip = n` to skip the first `n` lines or use `comment = "#"` to drop all lines that start with (e.g.) `#`:
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								read_csv(
 								  "The first line of metadata
 								  The second line of metadata
 								  x,y,z
 ,2,3",
 								  skip = 2
 								)
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								read_csv(
 								  "# A comment I want to skip
 								  x,y,z
 ,2,3",
 								  comment = "#"
 								)
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								In other cases, the data might not have column names.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings and instead label them sequentially from `X1` to `Xn`:
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								read_csv(
 								  "1,2,3
 ,5,6",
 								  col_names = FALSE
 								)
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								Alternatively, you can pass `col_names` a character vector which will be used as the column names:
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
 								```{r}
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								#| message: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								read_csv(
 								  "1,2,3
 ,5,6",
 								  col_names = c("x", "y", "z")
 								)
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								These arguments are all you need to know to read the majority of CSV files that you'll encounter in practice.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								(For the rest, you'll need to carefully inspect your `.csv` file and read the documentation for `read_csv()`'s many other arguments.)
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								### Other file types
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								Once you've mastered `read_csv()`, using readr's other functions is straightforward; it's just a matter of knowing which function to reach for:
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `read_csv2()` reads semicolon-separated files.
 								    These use `;` instead of `,` to separate fields and are common in countries that use `,` as the decimal marker.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `read_tsv()` reads tab-delimited files.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimiter if you don't specify it.
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `read_fwf()` reads fixed-width files.
 								    You can specify fields by their widths with `fwf_widths()` or by their positions with `fwf_positions()`.
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `read_table()` reads a common variation of fixed-width files where columns are separated by white space.
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `read_log()` reads Apache-style log files.
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
-												Fix typos

											
										
										
											2016-07-12 09:29:17 +08:00
+								### Exercises
-												More on import

											
										
										
											2016-07-09 05:23:19 +08:00
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+.  What function would you use to read a file where fields were separated with "\|"?
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
 .  What are the most important arguments to `read_fwf()`?
 .  Sometimes strings in a CSV file contain commas.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								    To prevent them from causing problems, they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
 								    To read the following text into a data frame, what argument to `read_csv()` do you need to specify?
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| eval: false
-												More on import

											
										
										
											2016-07-09 05:23:19 +08:00
+								    "x,y\n1,'a,b'"
 								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  Identify what is wrong with each of the following inline CSV files.
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								    What happens when you run the code?
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| eval: false
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								    read_csv("a,b\n1,2,3\n4,5,6")
 								    read_csv("a,b,c\n1,2\n1,2,3,4")
 								    read_csv("a,b\n\"1")
 								    read_csv("a,b\n1,2\na,b")
 								    read_csv("a;b\n1;3")
 								    ```
-												More on import

											
										
										
											2016-07-09 05:23:19 +08:00
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
+.  Practice referring to non-syntactic names in the following data frame by:
 								    a.  Extracting the variable called `1`.
-												Fix vs. e.g. i.e. punctuation (#1157)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-12-07 00:51:11 +08:00
+								    b.  Plotting a scatterplot of `1` vs. `2`.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								    c.  Creating a new column called `3`, which is `2` divided by `1`.
 								    d.  Renaming the columns to `one`, `two`, and `three`.
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
 								    ```{r}
 								    annoying <- tibble(
 								      `1` = 1:10,
 								      `2` = `1` * 2 + rnorm(length(`1`))
 								    )
 								    ```
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								## Controlling column types {#sec-col-types}
-												Fix vs. e.g. i.e. punctuation (#1249)


											
										
										
											2023-01-27 23:51:22 +08:00
+								A CSV file doesn't contain any information about the type of each variable (i.e. whether it's a logical, number, string, etc.), so readr will try to guess the type.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself.
 								Finally, we'll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
 								### Guessing types
 								readr uses a heuristic to figure out the column types.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								For each column, it pulls the values of 1,000[^data-import-2] rows spaced evenly from the first row to the last, ignoring missing values.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								It then works through the following questions:
 								[^data-import-2]: You can override the default of 1000 with the `guess_max` argument.
 								-   Does it contain only `F`, `T`, `FALSE`, or `TRUE` (ignoring case)? If so, it's a logical.
-												Clarify plaintext and fix punctuation (#1251)


											
										
										
											2023-01-28 05:03:36 +08:00
+								-   Does it contain only numbers (e.g. `1`, `-4.5`, `5e6`, `Inf`)? If so, it's a number.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   Does it match the ISO8601 standard? If so, it's a date or date-time. (We'll return to date-times in more detail in @sec-creating-datetimes).
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								-   Otherwise, it must be a string.
 								You can see that behavior in action in this simple example:
 								```{r}
 								read_csv("
 								  logical,numeric,date,string
 								  TRUE,1,2021-01-15,abc
 								  false,4.5,2021-02-15,def
 								  T,Inf,2021-02-16,ghi"
 								)
 								```
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								This heuristic works well if you have a clean dataset, but in real life, you'll encounter a selection of weird and beautiful failures.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
 								### Missing values, column types, and problems
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type.
 								One of the most common causes for this is a missing value, recorded using something other than the `NA` that stringr expects.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
 								Take this simple 1 column CSV file as an example:
 								```{r}
 								csv <- "
 								  x
 
 								  .
 
 "
 								```
 								If we read it without any additional arguments, `x` becomes a character column:
 								```{r}
 								df <- read_csv(csv)
 								```
 								In this very small case, you can easily see the missing value `.`.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								But what happens if you have thousands of rows with only a few missing values represented by `.`s speckled among them?
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								One approach is to tell readr that `x` is a numeric column, and then see where it fails.
 								You can do that with the `col_types` argument, which takes a named list:
 								```{r}
 								df <- read_csv(csv, col_types = list(x = col_double()))
 								```
 								Now `read_csv()` reports that there was a problem, and tells us we can find out more with `problems()`:
 								```{r}
 								problems(df)
 								```
 								This tells us that there was a problem in row 3, col 1 where readr expected a double but got a `.`.
 								That suggests this dataset uses `.` for missing values.
 								So then we set `na = "."`, the automatic guessing succeeds, giving us the numeric column that we want:
 								```{r}
 								df <- read_csv(csv, na = ".")
 								```
 								### Column types
 								readr provides a total of nine column types for you to use:
 								-   `col_logical()` and `col_double()` read logicals and real numbers. They're relatively rarely needed (except as above), since readr will usually guess them for you.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `col_integer()` reads integers. We distinguish integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								-   `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								-   `col_factor()`, `col_date()`, and `col_datetime()` create factors, dates, and date-times respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-dates-and-times.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								-   `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You'll learn more about it in @sec-numbers.
 								-   `col_skip()` skips a column so it's not included in the result.
 								It's also possible to override the default column by switching from `list()` to `cols()`:
 								```{r}
 								csv <- "
 								x,y,z
 ,2,3"
 								read_csv(csv, col_types = cols(.default = col_character()))
 								```
 								Another useful helper is `cols_only()` which will read in only the columns you specify:
 								```{r}
 								read_csv(
 								  "x,y,z
 ,2,3",
 								  col_types = cols_only(x = col_character())
 								)
 								```
-												Iterating on iteration

											
										
										
											2022-09-20 22:13:51 +08:00
+								## Reading data from multiple files {#sec-readr-directory}
-												More on parsing vectors

											
										
										
											2016-07-08 00:17:11 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								Sometimes your data is split across multiple files instead of being contained in a single file.
 								For example, you might have sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March.
 								With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame.
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```{r}
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
 								read_csv(sales_files, id = "file")
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
 								This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								If you have many files you want to read in, it can get cumbersome to write out their names as a list.
-												Feedback on iteration chapter (#1130)


											
										
										
											2022-11-11 22:00:44 +08:00
+								Instead, you can use the base `list.files()` function to find the files for you by matching a pattern in the file names.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								You'll learn more about these patterns in @sec-regular-expressions.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								```{r}
-												Feedback on iteration chapter (#1130)


											
										
										
											2022-11-11 22:00:44 +08:00
+								sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								sales_files
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								## Writing to a file {#sec-writing-to-a-file}
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.
-												Data import proofing

											
										
										
											2016-08-12 21:09:18 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| eval: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								write_csv(students, "students.csv")
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								```
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								Now let's read that csv file back in.
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								Note that the type information is lost when you save to csv:
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| warning: false
 								#| message: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								students
 								write_csv(students, "students-2.csv")
 								read_csv("students-2.csv")
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
-												Adds chapter about arrow (#1137)

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Hadley Wickham <hadley@posit.co>
Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-12-08 09:43:11 +08:00
+								There are two main alternative:
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
 								    These store data in R's custom binary format called RDS:
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    ```{r}
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								    write_rds(students, "students.rds")
 								    read_rds("students.rds")
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Adds chapter about arrow (#1137)

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Hadley Wickham <hadley@posit.co>
Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-12-08 09:43:11 +08:00
+.  The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages.
-												Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
											
										
										
											2023-01-03 15:06:27 +08:00
+								    We'll return to arrow in more depth in @sec-arrow.
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| eval: false
-												Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
											
										
										
											2022-11-17 23:56:08 +08:00
+								    library(arrow)
 								    write_parquet(students, "students.parquet")
 								    read_parquet("students.parquet")
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								    #> # A tibble: 6 × 5
 								    #>   student_id full_name        favourite_food     meal_plan             age
 								    #>        <dbl> <chr>            <chr>              <fct>               <dbl>
 								    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
 								    #> 2          2 Barclay Lynn     French fries       Lunch only              5
 								    #> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
 								    #> 4          4 Leon Rossini     Anchovies          Lunch only             NA
 								    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
 								    #> 6          6 Güvenç Attila    Ice cream          Lunch only              6
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    ```
-												Rough notes for import & transform

											
										
										
											2015-09-21 21:41:14 +08:00
-												Adds chapter about arrow (#1137)

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Hadley Wickham <hadley@posit.co>
Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-12-08 09:43:11 +08:00
+								Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.
-												Rough notes for import & transform

											
										
										
											2015-09-21 21:41:14 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| include: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								file.remove("students-2.csv")
 								file.remove("students.rds")
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								```
-												Rough first pass at summaries for all whole game chapters

											
										
										
											2022-09-29 23:36:22 +08:00
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
+								## Data entry
 								Sometimes you'll need to assemble a tibble "by hand" doing a little data entry in your R script.
 								There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows.
 								`tibble()` works by column:
 								```{r}
 								tibble(
 								  x = c(1, 2, 5),
 								  y = c("h", "m", "g"),
 								  z = c(0.08, 0.83, 0.60)
 								)
 								```
 								Note that every column in tibble must be same size, so you'll get an error if they're not:
 								```{r}
 								#| error: true
 								tibble(
 								  x = c(1, 2),
 								  y = c("h", "m", "g"),
 								  z = c(0.08, 0.83, 0.6)
 								)
 								```
 								Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for **tr**ansposed t**ibble**, which lets you lay out your data row by row.
 								`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
 								This makes it possible to lay out small amounts of data in an easy to read form:
 								```{r}
 								tribble(
 								  ~x, ~y, ~z,
 								  "h", 1, 0.08,
 								  "m", 2, 0.83,
 								  "g", 5, 0.60,
 								)
 								```
 								We'll use `tibble()` and `tribble()` later in the book to construct small examples to demonstrate how various functions work.
-												Rough first pass at summaries for all whole game chapters

											
										
										
											2022-09-29 23:36:22 +08:00
+								## Summary
-												Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
											
										
										
											2022-10-25 02:16:14 +08:00
+								In this chapter, you've learned how to load CSV files with `read_csv()` and to do your own data entry with `tibble()` and `tribble()`.
-												Rough first pass at summaries for all whole game chapters

											
										
										
											2022-09-29 23:36:22 +08:00
+								You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
-												Improve sentences (#1279)

It slightly improves sentences and fixes some typos.
											
										
										
											2023-02-14 03:29:46 +08:00
+								We'll come to data import a few times in this book: @sec-import-spreadsheets from Excel and Google Sheets, @sec-import-databases will show you how to load data from databases, @sec-arrow from parquet files, @sec-rectangling from JSON, and @sec-scraping from websites.
-												Rough first pass at summaries for all whole game chapters

											
										
										
											2022-09-29 23:36:22 +08:00
-												Eliminate pipes chapter (#1332)

More pipes in to transform chapter, and reflow chapter summaries.
											
										
										
											2023-03-02 03:34:26 +08:00
+								We're just about at the end of this section of the book, but there's one important last topic to cover: how to get help.
 								So in the next chapter, you'll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.