Mild import/wrangling reorg
This commit is contained in:
parent
23bfba6809
commit
8f7748dcb1
|
@ -14,5 +14,5 @@ libs
|
|||
_main.*
|
||||
tmp-pdfcrop-*
|
||||
figures
|
||||
|
||||
/.quarto/
|
||||
site_libs
|
||||
|
|
2
EDA.qmd
2
EDA.qmd
|
@ -81,7 +81,7 @@ To make the discussion easier, let's define some terms:
|
|||
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
|
||||
|
||||
So far, all of the data that you've seen has been tidy.
|
||||
In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data].
|
||||
In real-life, most data isn't tidy, so we'll come back to these ideas again in @sec-rectangling.
|
||||
|
||||
## Variation
|
||||
|
||||
|
|
13
_quarto.yml
13
_quarto.yml
|
@ -65,14 +65,13 @@ book:
|
|||
- missing-values.qmd
|
||||
- column-wise.qmd
|
||||
|
||||
- part: import.qmd
|
||||
- part: wrangle.qmd
|
||||
chapters:
|
||||
- import-rectangular.qmd
|
||||
- import-spreadsheets.qmd
|
||||
- import-databases.qmd
|
||||
- rectangle.qmd
|
||||
- import-webscrape.qmd
|
||||
- import-other.qmd
|
||||
- parsing.qmd
|
||||
- spreadsheets.qmd
|
||||
- databases.qmd
|
||||
- rectangling.qmd
|
||||
- webscraping.qmd
|
||||
|
||||
- part: program.qmd
|
||||
chapters:
|
||||
|
|
|
@ -11,8 +11,7 @@ status("polishing")
|
|||
|
||||
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
|
||||
In this chapter, you'll learn how to read plain-text rectangular files into R.
|
||||
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
|
||||
We'll finish with a few pointers to packages that are useful for other types of data.
|
||||
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -320,33 +319,10 @@ There are two alternatives:
|
|||
```
|
||||
|
||||
Feather tends to be faster than RDS and is usable outside of R.
|
||||
RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not.
|
||||
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
|
||||
|
||||
```{r}
|
||||
#| include: false
|
||||
|
||||
file.remove("students-2.csv")
|
||||
file.remove("students.rds")
|
||||
```
|
||||
|
||||
## Other types of data
|
||||
|
||||
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
|
||||
They're certainly not perfect, but they are a good place to start.
|
||||
For rectangular data:
|
||||
|
||||
- **readxl** reads Excel files (both `.xls` and `.xlsx`).
|
||||
See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets.
|
||||
|
||||
- **googlesheets4** reads Google Sheets.
|
||||
Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets.
|
||||
|
||||
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
|
||||
See [Chapter -@sec-import-databases] for more on working with databases .
|
||||
|
||||
- **haven** reads SPSS, Stata, and SAS files.
|
||||
|
||||
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
|
||||
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.
|
||||
|
||||
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
|
||||
|
|
|
@ -557,7 +557,7 @@ df <- tribble(
|
|||
)
|
||||
```
|
||||
|
||||
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]:
|
||||
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling:
|
||||
|
||||
```{r}
|
||||
df |> pivot_wider(
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Rectangular data {#sec-import-rectangular}
|
||||
# Parsing {#sec-import-rectangular}
|
||||
|
||||
```{r}
|
||||
#| results: "asis"
|
|
@ -1,4 +1,4 @@
|
|||
# Data rectangling {#sec-rectangle-data}
|
||||
# Data rectangling {#sec-rectangling}
|
||||
|
||||
```{r}
|
||||
#| results: "asis"
|
||||
|
@ -86,10 +86,10 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
|||
str(x5)
|
||||
```
|
||||
|
||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
|
||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
|
||||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
||||
|
||||
[^rectangle-1]: This is an RStudio feature.
|
||||
[^rectangling-1]: This is an RStudio feature.
|
||||
|
||||
```{r}
|
||||
#| label: fig-view-collapsed
|
28
tidy.qmd
28
tidy.qmd
|
@ -1,28 +0,0 @@
|
|||
# Tidy {#sec-tidy-intro .unnumbered}
|
||||
|
||||
```{r}
|
||||
#| results: "asis"
|
||||
#| echo: false
|
||||
source("_common.R")
|
||||
```
|
||||
|
||||
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualization and modelling.
|
||||
Data wrangling is very important: without it you can't work with your own data!
|
||||
There are three main parts to data wrangling:
|
||||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| out-width: "75%"
|
||||
|
||||
knitr::include_graphics("diagrams/data-science-wrangle.png")
|
||||
```
|
||||
|
||||
<!--# TO DO: Redo the diagram without highlighting import. -->
|
||||
|
||||
This part of the book proceeds as follows:
|
||||
|
||||
- [Chapter -@sec-list-columns] will give you tools for working with list columns --- data stored in columns of a tibble as lists.
|
||||
|
||||
- In [Chapter -@sec-rectangle-data], you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
|
||||
|
||||
<!--# TO DO: Revisit bullet points about new chapters. -->
|
|
@ -1,4 +1,4 @@
|
|||
# Wrangle {#sec-import-intro .unnumbered}
|
||||
# Wrangle {#sec-wrangle .unnumbered}
|
||||
|
||||
```{r}
|
||||
#| results: "asis"
|
||||
|
@ -14,14 +14,20 @@ But in more complex cases it encompasses both tidying and transformation as the
|
|||
|
||||
This part of the book proceeds as follows:
|
||||
|
||||
- In @sec-import-rectangular, you'll learn how to get plain-text data in rectangular formats from disk and into R.
|
||||
- In @sec-rectangling, you'll learn how to get plain-text data in rectangular formats from disk and into R.
|
||||
|
||||
- In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
|
||||
|
||||
- In @sec-import-databases, you'll learn about getting data into R from databases.
|
||||
|
||||
- In @sec-rectangle-data, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
|
||||
- In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
|
||||
|
||||
- In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.
|
||||
|
||||
- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in @sec-import-other.
|
||||
Some other types of data are not covered in this book:
|
||||
|
||||
- **haven** reads SPSS, Stata, and SAS files.
|
||||
|
||||
- xml2 for **xml2** for XML
|
||||
|
||||
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
|
Loading…
Reference in New Issue