Edited intro to streamline prereqs and define variable, observation, and value.
This commit is contained in:
parent
d5d52f05c6
commit
80ab0f1752
70
intro.Rmd
70
intro.Rmd
|
@ -12,7 +12,7 @@ The goal of "R for Data Science" is to give you a solid foundation into using R
|
|||
|
||||
* Getting your data into R so you can work with it.
|
||||
|
||||
* Wrangling your data into a tidy form, so it's easier to work with and you
|
||||
* Wrangling your data into a tidy form, so it's easier to work with. This let's you
|
||||
spend your time struggling with your questions, not fighting to get data
|
||||
into the right form for different functions.
|
||||
|
||||
|
@ -37,7 +37,7 @@ The goal of "R for Data Science" is to give you a solid foundation into using R
|
|||
|
||||
## Learning data science
|
||||
|
||||
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, howver, is not the order you'll encounter them in this book. This is because:
|
||||
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, however, is not the order you'll encounter them in this book. This is because:
|
||||
|
||||
* Starting with data ingest is boring. It's much more interesting to learn
|
||||
some new visualisation and manipulation tools on data that's already been
|
||||
|
@ -45,15 +45,30 @@ Above, I've listed the components of the data science process in roughly the ord
|
|||
to your own data.
|
||||
|
||||
* Some topics, like modelling, are best explained with other tools, like
|
||||
visualisation and manipulation. These need to come later in the book.
|
||||
visualisation and manipulation. These topics need to come later in the book.
|
||||
|
||||
We've designed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.
|
||||
We've honed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.
|
||||
|
||||
Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)
|
||||
|
||||
## Talking about data science
|
||||
|
||||
Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful.
|
||||
|
||||
* A _variable_ is a quantity, quality, or property that you can measure.
|
||||
|
||||
* A _value_ is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
|
||||
|
||||
* An _observation_ is a set of measurments you make under similar conditions (usually all at the same time or on the same object). Observations contain values that you measure on different variables.
|
||||
|
||||
These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.
|
||||
|
||||
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation.
|
||||
|
||||
Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like the art of teaching is tricking people to do what's in their own best interests.)
|
||||
|
||||
## R and big data
|
||||
|
||||
This book focuses almost exclusively on in-memory datasets.
|
||||
This book also focuses almost exclusively on in-memory datasets.
|
||||
|
||||
* Small data: data that fits in memory on a laptop, ~10 GB. Note that small
|
||||
data is still big! R is great with small data.
|
||||
|
@ -87,62 +102,41 @@ The other thing to bear in mind, is that while all your data might be big, typic
|
|||
|
||||
## Prerequisites
|
||||
|
||||
To run the code in this book, you will need to have R installed on your computer, as well as the RStudio IDE, an application that makes it easier to use R. Both R and the RStudio IDE are free and easy to install.
|
||||
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are free and easy to install.
|
||||
|
||||
### R
|
||||
|
||||
To install R, visit [cran.r-project.org](http://cran.r-project.org). Then click the link that matches your operating system. What you do next will depend on your operating system.
|
||||
To install R, visit [cran.r-project.org](http://cran.r-project.org) and click the link that matches your operating system. What you do next will depend on your operating system.
|
||||
|
||||
* Mac users should click the most current release. This will be the `.pkg` file at the top of the page. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.
|
||||
* Mac users should click the `.pkg` file at the top of the page. This file contains the most current release of R. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.
|
||||
|
||||
* Windows users should click "base" and then download the most current version of R, which will be linked at the top of the page.
|
||||
|
||||
* Linux users will need to select their distribution and then follow the distribution specific instructions to install R. [cran.r-project.org](https://cran.r-project.org/bin/linux/) includes these instructions along side of the files to download.
|
||||
|
||||
* Linux users should select their distribution and then follow the distribution specific instructions to install R. [cran.r-project.org](https://cran.r-project.org/bin/linux/) includes these instructions alongside the files to download.
|
||||
|
||||
### RStudio
|
||||
|
||||
Once you have R installed, it is time to download RStudio. To download RStudio, visit [www.rstudio.com/download](http://www.rstudio.com/download).
|
||||
|
||||
Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio is installed, open it as you would open any other application.
|
||||
After you install R, visit [www.rstudio.com/download](http://www.rstudio.com/download) to download the RStudio IDE. Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio IDE is installed, open it as you would open any other application.
|
||||
|
||||
### R Packages
|
||||
|
||||
Some of the most useful parts of R come in _packages_, collections of functions and code that you can download in addition to base R. We will use several packages in this book. These include the `DBI`, `devtools`, `dplyr`, `DSR`, `ggplot2`, `haven`, `knitr`, `lubridate`, `packrat`, `readr`, `rmarkdown`, `rsqlite`, `rvest`, `scales `, `shiny`, `stringr`, and `tidyr` packages.
|
||||
An R _package_ is a collection of functions, data sets, and help files that extends the R language. We will use several packages in this book: `DBI`, `devtools`, `dplyr`, `ggplot2`, `haven`, `knitr`, `lubridate`, `packrat`, `readr`, `rmarkdown`, `rsqlite`, `rvest`, `scales `, `shiny`, `stringr`, and `tidyr`.
|
||||
|
||||
There are two general ways to install packages for R. Both require you to have an internet connection, to start an R session (by opening the RStudio IDE), and to run a command at the command line.
|
||||
|
||||
The most common way to install R packages is to download them from the package repository at [cran.r-project.org](http://cran.r-project.org). To do this run the command, `install.packages()`. Give `install.packages()` the name or names of the packages you wish to install as a character vector. R will download the packages from [cran.r-project.org](http://cran.r-project.org) and install them in your system library.
|
||||
|
||||
You can use this method to download all but one of the packages listed above. To do so, open R and run the command
|
||||
To install these packages, open the RStudio IDE and run the command
|
||||
|
||||
```{r eval = FALSE}
|
||||
install.packages(c("DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate", "packrat", "readr", "rmarkdown", "rsqlite", "rvest", "scales", "shiny", "stringr", "tidyr"))
|
||||
```
|
||||
|
||||
Some R packages are not stored on [cran.r-project.org](http://cran.r-project.org), but are hosted in online repositories maintained by the package's developer. The most common place to host these packages is [www.github.com](http://www.github.com).
|
||||
R will download the packages from [cran.r-project.org](http://cran.r-project.org) and instll them in your system library. So be sure that you are connected to the internet, and that you have not blocked [cran.r-project.org](http://cran.r-project.org)in your firewall or proxy settings.
|
||||
|
||||
For example, `DSR` is a collection of data sets that we have assembled for this book and saved online as a github repository ([github.com/garrettgman/DSR](http://github.com/garrettgman/DSR)).
|
||||
|
||||
You can install packages stored on github with the `install_github()` function in the `devtools` package. (You can install the `devtools` package itself from [cran.r-project.org](http://cran.r-project.org) with `install.packages()`). To use the function, pass it a characterstring with the form `"<github username>/<github repository name>".
|
||||
|
||||
To install `DSR`, run the command
|
||||
After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
|
||||
|
||||
```{r eval = FALSE}
|
||||
devtools::install_github("garrettgman/DSR")
|
||||
library(tidyr)
|
||||
```
|
||||
|
||||
#### `library()`
|
||||
|
||||
When R installs a package, it downloads the package to your system library. This does not automatically load the contents of the package into your current or future R sessions. To use the functions and data sets that come in an R package saved in your system library, you must load the package into your current R session with `library()`.
|
||||
|
||||
For example, to use the functions in the `tidyr` package, you would need to first run
|
||||
|
||||
```{r eval = FALSE}
|
||||
library("tidyr")
|
||||
```
|
||||
|
||||
You will need to rerun this command each time you open a new R session in which you wish to use the `tidyr` package.
|
||||
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session.
|
||||
|
||||
### Getting help
|
||||
|
||||
|
|
Loading…
Reference in New Issue