Mild import/wrangling reorg
This commit is contained in:
		
							
								
								
									
										2
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							
							
						
						
									
										2
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							@@ -14,5 +14,5 @@ libs
 | 
			
		||||
_main.*
 | 
			
		||||
tmp-pdfcrop-*
 | 
			
		||||
figures
 | 
			
		||||
 | 
			
		||||
/.quarto/
 | 
			
		||||
site_libs
 | 
			
		||||
 
 | 
			
		||||
							
								
								
									
										2
									
								
								EDA.qmd
									
									
									
									
									
								
							
							
						
						
									
										2
									
								
								EDA.qmd
									
									
									
									
									
								
							@@ -81,7 +81,7 @@ To make the discussion easier, let's define some terms:
 | 
			
		||||
    Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
 | 
			
		||||
 | 
			
		||||
So far, all of the data that you've seen has been tidy.
 | 
			
		||||
In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data].
 | 
			
		||||
In real-life, most data isn't tidy, so we'll come back to these ideas again in @sec-rectangling.
 | 
			
		||||
 | 
			
		||||
## Variation
 | 
			
		||||
 | 
			
		||||
 
 | 
			
		||||
							
								
								
									
										13
									
								
								_quarto.yml
									
									
									
									
									
								
							
							
						
						
									
										13
									
								
								_quarto.yml
									
									
									
									
									
								
							@@ -65,14 +65,13 @@ book:
 | 
			
		||||
        - missing-values.qmd
 | 
			
		||||
        - column-wise.qmd
 | 
			
		||||
 | 
			
		||||
    - part: import.qmd
 | 
			
		||||
    - part: wrangle.qmd
 | 
			
		||||
      chapters:
 | 
			
		||||
        - import-rectangular.qmd
 | 
			
		||||
        - import-spreadsheets.qmd
 | 
			
		||||
        - import-databases.qmd
 | 
			
		||||
        - rectangle.qmd
 | 
			
		||||
        - import-webscrape.qmd
 | 
			
		||||
        - import-other.qmd
 | 
			
		||||
        - parsing.qmd
 | 
			
		||||
        - spreadsheets.qmd
 | 
			
		||||
        - databases.qmd
 | 
			
		||||
        - rectangling.qmd
 | 
			
		||||
        - webscraping.qmd
 | 
			
		||||
 | 
			
		||||
    - part: program.qmd
 | 
			
		||||
      chapters:
 | 
			
		||||
 
 | 
			
		||||
@@ -11,8 +11,7 @@ status("polishing")
 | 
			
		||||
 | 
			
		||||
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
 | 
			
		||||
In this chapter, you'll learn how to read plain-text rectangular files into R.
 | 
			
		||||
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
 | 
			
		||||
We'll finish with a few pointers to packages that are useful for other types of data.
 | 
			
		||||
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.
 | 
			
		||||
 | 
			
		||||
### Prerequisites
 | 
			
		||||
 | 
			
		||||
@@ -320,33 +319,10 @@ There are two alternatives:
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
Feather tends to be faster than RDS and is usable outside of R.
 | 
			
		||||
RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not.
 | 
			
		||||
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| include: false
 | 
			
		||||
 | 
			
		||||
file.remove("students-2.csv")
 | 
			
		||||
file.remove("students.rds")
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Other types of data
 | 
			
		||||
 | 
			
		||||
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
 | 
			
		||||
They're certainly not perfect, but they are a good place to start.
 | 
			
		||||
For rectangular data:
 | 
			
		||||
 | 
			
		||||
-   **readxl** reads Excel files (both `.xls` and `.xlsx`).
 | 
			
		||||
    See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets.
 | 
			
		||||
 | 
			
		||||
-   **googlesheets4** reads Google Sheets.
 | 
			
		||||
    Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets.
 | 
			
		||||
 | 
			
		||||
-   **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
 | 
			
		||||
    See [Chapter -@sec-import-databases] for more on working with databases .
 | 
			
		||||
 | 
			
		||||
-   **haven** reads SPSS, Stata, and SAS files.
 | 
			
		||||
 | 
			
		||||
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
 | 
			
		||||
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.
 | 
			
		||||
 | 
			
		||||
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
 | 
			
		||||
 
 | 
			
		||||
@@ -557,7 +557,7 @@ df <- tribble(
 | 
			
		||||
)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]:
 | 
			
		||||
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
df |> pivot_wider(
 | 
			
		||||
 
 | 
			
		||||
@@ -1,4 +1,4 @@
 | 
			
		||||
# Rectangular data {#sec-import-rectangular}
 | 
			
		||||
# Parsing {#sec-import-rectangular}
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| results: "asis"
 | 
			
		||||
@@ -1,4 +1,4 @@
 | 
			
		||||
# Data rectangling {#sec-rectangle-data}
 | 
			
		||||
# Data rectangling {#sec-rectangling}
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| results: "asis"
 | 
			
		||||
@@ -86,10 +86,10 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
 | 
			
		||||
str(x5)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
 | 
			
		||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
 | 
			
		||||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
 | 
			
		||||
 | 
			
		||||
[^rectangle-1]: This is an RStudio feature.
 | 
			
		||||
[^rectangling-1]: This is an RStudio feature.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| label: fig-view-collapsed
 | 
			
		||||
							
								
								
									
										28
									
								
								tidy.qmd
									
									
									
									
									
								
							
							
						
						
									
										28
									
								
								tidy.qmd
									
									
									
									
									
								
							@@ -1,28 +0,0 @@
 | 
			
		||||
# Tidy {#sec-tidy-intro .unnumbered}
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| results: "asis"
 | 
			
		||||
#| echo: false
 | 
			
		||||
source("_common.R")
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualization and modelling.
 | 
			
		||||
Data wrangling is very important: without it you can't work with your own data!
 | 
			
		||||
There are three main parts to data wrangling:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| echo: false
 | 
			
		||||
#| out-width: "75%"
 | 
			
		||||
 | 
			
		||||
knitr::include_graphics("diagrams/data-science-wrangle.png")
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
<!--# TO DO: Redo the diagram without highlighting import. -->
 | 
			
		||||
 | 
			
		||||
This part of the book proceeds as follows:
 | 
			
		||||
 | 
			
		||||
-   [Chapter -@sec-list-columns] will give you tools for working with list columns --- data stored in columns of a tibble as lists.
 | 
			
		||||
 | 
			
		||||
-   In [Chapter -@sec-rectangle-data], you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
 | 
			
		||||
 | 
			
		||||
<!--# TO DO: Revisit bullet points about new chapters. -->
 | 
			
		||||
@@ -1,4 +1,4 @@
 | 
			
		||||
# Wrangle {#sec-import-intro .unnumbered}
 | 
			
		||||
# Wrangle {#sec-wrangle .unnumbered}
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| results: "asis"
 | 
			
		||||
@@ -14,14 +14,20 @@ But in more complex cases it encompasses both tidying and transformation as the
 | 
			
		||||
 | 
			
		||||
This part of the book proceeds as follows:
 | 
			
		||||
 | 
			
		||||
-   In @sec-import-rectangular, you'll learn how to get plain-text data in rectangular formats from disk and into R.
 | 
			
		||||
-   In @sec-rectangling, you'll learn how to get plain-text data in rectangular formats from disk and into R.
 | 
			
		||||
 | 
			
		||||
-   In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
 | 
			
		||||
 | 
			
		||||
-   In @sec-import-databases, you'll learn about getting data into R from databases.
 | 
			
		||||
 | 
			
		||||
-   In @sec-rectangle-data, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
 | 
			
		||||
-   In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
 | 
			
		||||
 | 
			
		||||
-   In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.
 | 
			
		||||
 | 
			
		||||
-   We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in @sec-import-other.
 | 
			
		||||
Some other types of data are not covered in this book:
 | 
			
		||||
 | 
			
		||||
-   **haven** reads SPSS, Stata, and SAS files.
 | 
			
		||||
 | 
			
		||||
-   xml2 for **xml2** for XML
 | 
			
		||||
 | 
			
		||||
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
 | 
			
		||||
		Reference in New Issue
	
	Block a user