r4ds/intro.Rmd

---
layout: default
title: Welcome
output: bookdown::html_chapter
---

```{r setup, include = FALSE}
source("common.R")
install.packages <- function(...) invisible()
```

# Welcome

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important tools that you need to do data science with in R. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R. 

## What you will learn

Data science is a huge field, and there's no way you can master by after reading a single book. The goal of this book is to give you a solid foundation into the most important tools. These are the tools that in our experience, people use everyday. There's definitely an 80-20 rule at play: you'll do 80% of every project using this handful of tools, but the remaining 20% will is much more variable. Our goal is to teach you that 80% and to point you to where you can learn more.

We think about data science as using six main tools:

`r bookdown::embed_png("diagrams/data-science.png")`

First you must __import__ your data in R. This typically means that you take data stored in file, in a database, or in an web API, and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!

Once you've imported your data, it's a good idea to __tidy__ it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way its storage. In brief, when your data is tidy, each column is a variable, and each row is an observation. Working with tidy data is important because the consistency lets you spend your time struggling with your questions, not fighting to get data into the right form for different functions.

Once you have tidy data, a common first step is to __transform__ it to add new variables that are functions of existing variables (like computing velocity from speed and distance), to rename the variables to be easier to understand, to sort your data, or summarise it.

There are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. For example, you might see a scatterplot that inspires you to fit a linear model, then you transform the data to add a column of residuals from the model, and look at another scatterplot.

__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you, but don't scale particularly well.

__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computation tool, so generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model can not fundamentally surprise you.

It doesn't matter how well models and visualisation have led you to understand the data, unless you can __commmunicate__ your results to other people. Communication is an absolutely critical part of any data analysis project.

## How you will learn

Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). In our experience, however, this is not the best way to learn them: 

* Starting with data ingest is boring. It's much more interesting to learn
  some new visualisation and manipulation tools on data that's already been
  imported and cleaned. You'll later learn the skills to apply these new ideas
  to your own data.
  
* You need to learn some cross-cutting tools that help in: programming, RStudio 
  IDE.
  
* Some topics, like modelling, are best explained with other tools, like
  visualisation and manipulation. These topics need to come later in the book.

We've honed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.

Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)

## Talking about data science

Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful. 

* A _variable_ is a quantity, quality, or property that you can measure. 

* A _value_ is the state of a variable when you measure it. The value of a 
  variable may change from measurement to measurement.

* An _observation_ is a set of measurments you make under similar conditions 
  (usually all at the same time or on the same object). Observations contain 
  values that you measure on different variables. 

These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.

This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.

## What you won't learn

There are some important topics that this book doesn't cover. Here I want to talk about them briefly and tell you why.

### Big data

This book proudly focussed on in-memory, or small, datasets.

* Small data: data that fits in memory on a laptop, ~10 GB. Note that small
  data is still big! R is great with small data. Pointer to data.table.
  
* Medium data: data that fits in memory on a powerful server, ~5 TB. It's
  possible to use R with this much data, but it's challenging. Dealing
  effectively with medium data requires effective use of all cores on a
  computer. It's not that hard to do that from R, but it requires some thought,
  and many packages do not take advantage of R's tools.
  
* Big data: data that must be stored on disk or spread across the memory of
  multiple machines. Writing code that works efficiently with this sort of data
  is a very challenging. Tools for this sort of data will never be written in
  R: they'll be written in a language specially designed for high performance
  computing like C/C++, Fortran or Scala. But R can still talk to these systems.
  
The other thing to bear in mind, is that while all your data might be big, typically you don't need all of it to answer a specific question:

* Many questions can be answered with the right small dataset. It's often
  possible to find a subset, subsample, or summary that fits in memory and
  still allows you to answer the question you're interested in. The challenge
  here is finding the right small data, which often requires a lot of iteration.
  
* Other challenges are because an individual problem might fit in memory,
  but you have hundreds of thousands or millions of them. For example, you 
  might want to fit a model to each person in your dataset. That would be
  trivial if you had just 10 or 100 people, but instead you have a million.
  Fortunately each problem is independent (sometimes called embarassingly
  parallel), so you just need a system (like hadoop) that allows you to
  send different datasets to different computers for processing.

### Python

In this book, you won't learn anything about Python, or any other programming language. This isn't because we think Python is bad! It's a great tool, and most data science teams use a mix of (at least!) R and Python.

However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should be only know one thing, just that you'll generally learn faster if you stick to one thing at a time.

### Non-data-frame data

No trees or graphs, images or sounds. 

## Prerequisites

We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.

To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:

1. Download R and install R from <https://www.r-project.org/alt-home/>.
1. Download and install RStudio from <http://www.rstudio.com/download>.
1. Open RStudio like you would any operating system.

You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install all the packages used in this book open RStudio and run:

```{r}
pkgs <- c(
  "bookdown", "broom", "dplyr", "ggplot2", "jpeg", "jsonlite", 
  "knitr", "microbenchmark", "png", "pryr", "purrr", "readr", "stringr", 
  "tidyr"
)
install.packages(pkgs)
```

R will download the packages from CRAN and install them in your system library. If you have problems installing, make that you are connected to the internet, and that you haven't blocked <http://cran.r-project.org> in your firewall or proxy.

After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.

```{r, eval = FALSE}
library(tidyr)
```

You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session.

## RStudio

Brief RStudio orientation (code, console, and output). Pointers to where to learn more.

Important keyboard shortcuts:

* Cmd + Enter: sends current line from editor to console.
* Tab: suggest possible completions for the text you've typed.
* Cmd + ↑: in the console, searches all commands you've typed that start with 
  those characters.
* Cmd + Shift + F10: restart.
* Alt + Shift + K: the keyboard shortcut that shows all the keyboard shortcuts.

Note about turning on save/load session off.

## Getting help

*   Google. Always a great place to start! Adding "R" to a query is usually
    enough to filter it down. If you ever hit an error message that you 
    don't know how to handle, great idea to google it. 
    
    If your operating system defaults to another language, you can use 
    `Sys.setenv(LANGUAGE = "en")` to tell R to use english. That's likely to
    get you to common solutions more quickly.
  
*   StackOverflow. How to make a reproducible example. 
    ([reprex](https://github.com/jennybc/reprex))
    
    Unfortunately the R stackoverflow community is not always the friendliest.
  
*   Twitter. #rstats hashtag is very welcoming. Great way to keep up with 
    what's happening in the community.

## Acknowledgements

* Jenny Bryan and Lionel Henry for many helpful discussions around working
  with lists and list-columns.

## Colophon

This book was built with:

```{r}
devtools::session_info(pkgs)
```
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00			`---`
			`layout: default`
			`title: Welcome`
			`output: bookdown::html_chapter`
			`---`
Working on intro 2015-12-08 03:55:44 +08:00
			```{r setup, include = FALSE}
			`source("common.R")`
			`install.packages <- function(...) invisible()`
			```

Add missing headings 2015-09-21 21:45:06 +08:00			`# Welcome`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important tools that you need to do data science with in R. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`## What you will learn`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`Data science is a huge field, and there's no way you can master by after reading a single book. The goal of this book is to give you a solid foundation into the most important tools. These are the tools that in our experience, people use everyday. There's definitely an 80-20 rule at play: you'll do 80% of every project using this handful of tools, but the remaining 20% will is much more variable. Our goal is to teach you that 80% and to point you to where you can learn more.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`We think about data science as using six main tools:`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`r bookdown::embed_png("diagrams/data-science.png")`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`First you must __import__ your data in R. This typically means that you take data stored in file, in a database, or in an web API, and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`Once you've imported your data, it's a good idea to __tidy__ it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way its storage. In brief, when your data is tidy, each column is a variable, and each row is an observation. Working with tidy data is important because the consistency lets you spend your time struggling with your questions, not fighting to get data into the right form for different functions.`
Bring back the communication section 2015-12-06 23:20:19 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`Once you have tidy data, a common first step is to __transform__ it to add new variables that are functions of existing variables (like computing velocity from speed and distance), to rename the variables to be easier to understand, to sort your data, or summarise it.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`There are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. For example, you might see a scatterplot that inspires you to fit a linear model, then you transform the data to add a column of residuals from the model, and look at another scatterplot.`

			`__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you, but don't scale particularly well.`

			`__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computation tool, so generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model can not fundamentally surprise you.`

			`It doesn't matter how well models and visualisation have led you to understand the data, unless you can __commmunicate__ your results to other people. Communication is an absolutely critical part of any data analysis project.`

			`## How you will learn`

			`Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). In our experience, however, this is not the best way to learn them:`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
			`* Starting with data ingest is boring. It's much more interesting to learn`
			`some new visualisation and manipulation tools on data that's already been`
			`imported and cleaned. You'll later learn the skills to apply these new ideas`
			`to your own data.`

Working on intro 2015-12-08 03:55:44 +08:00			`* You need to learn some cross-cutting tools that help in: programming, RStudio`
			`IDE.`

Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00			`* Some topics, like modelling, are best explained with other tools, like`
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00			`visualisation and manipulation. These topics need to come later in the book.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00			`We've honed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.`

Bring back the communication section 2015-12-06 23:20:19 +08:00			`Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)`
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00
			`## Talking about data science`

			`Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful.`

			`* A _variable_ is a quantity, quality, or property that you can measure.`

Bring back the communication section 2015-12-06 23:20:19 +08:00			`* A _value_ is the state of a variable when you measure it. The value of a`
			`variable may change from measurement to measurement.`
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00
Bring back the communication section 2015-12-06 23:20:19 +08:00			`* An _observation_ is a set of measurments you make under similar conditions`
			`(usually all at the same time or on the same object). Observations contain`
			`values that you measure on different variables.`
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00
			`These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.`

Bring back the communication section 2015-12-06 23:20:19 +08:00			`This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`## What you won't learn`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`There are some important topics that this book doesn't cover. Here I want to talk about them briefly and tell you why.`

			`### Big data`

			`This book proudly focussed on in-memory, or small, datasets.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
			`* Small data: data that fits in memory on a laptop, ~10 GB. Note that small`
Bring back the communication section 2015-12-06 23:20:19 +08:00			`data is still big! R is great with small data. Pointer to data.table.`
Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00
			`* Medium data: data that fits in memory on a powerful server, ~5 TB. It's`
			`possible to use R with this much data, but it's challenging. Dealing`
			`effectively with medium data requires effective use of all cores on a`
			`computer. It's not that hard to do that from R, but it requires some thought,`
			`and many packages do not take advantage of R's tools.`

			`* Big data: data that must be stored on disk or spread across the memory of`
			`multiple machines. Writing code that works efficiently with this sort of data`
			`is a very challenging. Tools for this sort of data will never be written in`
			`R: they'll be written in a language specially designed for high performance`
			`computing like C/C++, Fortran or Scala. But R can still talk to these systems.`

			`The other thing to bear in mind, is that while all your data might be big, typically you don't need all of it to answer a specific question:`

			`* Many questions can be answered with the right small dataset. It's often`
			`possible to find a subset, subsample, or summary that fits in memory and`
			`still allows you to answer the question you're interested in. The challenge`
			`here is finding the right small data, which often requires a lot of iteration.`

			`* Other challenges are because an individual problem might fit in memory,`
			`but you have hundreds of thousands or millions of them. For example, you`
			`might want to fit a model to each person in your dataset. That would be`
			`trivial if you had just 10 or 100 people, but instead you have a million.`
			`Fortunately each problem is independent (sometimes called embarassingly`
			`parallel), so you just need a system (like hadoop) that allows you to`
			`send different datasets to different computers for processing.`

Working on intro 2015-12-08 03:55:44 +08:00			`### Python`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`In this book, you won't learn anything about Python, or any other programming language. This isn't because we think Python is bad! It's a great tool, and most data science teams use a mix of (at least!) R and Python.`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should be only know one thing, just that you'll generally learn faster if you stick to one thing at a time.`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`### Non-data-frame data`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`No trees or graphs, images or sounds.`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`## Prerequisites`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.`
Bring back the communication section 2015-12-06 23:20:19 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:`
Bring back the communication section 2015-12-06 23:20:19 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`1. Download R and install R from <https://www.r-project.org/alt-home/>.`
			`1. Download and install RStudio from <http://www.rstudio.com/download>.`
			`1. Open RStudio like you would any operating system.`
Bring back the communication section 2015-12-06 23:20:19 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install all the packages used in this book open RStudio and run:`
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			```{r}
			`pkgs <- c(`
Match packages to what we're installing on travis 2015-12-08 04:01:40 +08:00			`"bookdown", "broom", "dplyr", "ggplot2", "jpeg", "jsonlite",`
			`"knitr", "microbenchmark", "png", "pryr", "purrr", "readr", "stringr",`
			`"tidyr"`
Working on intro 2015-12-08 03:55:44 +08:00			`)`
			`install.packages(pkgs)`
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00			```

Bring back the communication section 2015-12-06 23:20:19 +08:00			`R will download the packages from CRAN and install them in your system library. If you have problems installing, make that you are connected to the internet, and that you haven't blocked <http://cran.r-project.org> in your firewall or proxy.`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00			After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			```{r, eval = FALSE}
Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00			`library(tidyr)`
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			```

Edited intro to streamline prereqs and define variable, observation, and value. 2015-10-13 05:20:16 +08:00			You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session.
Need to also discuss where to get help 2015-09-22 20:59:27 +08:00
Working on intro 2015-12-08 03:55:44 +08:00			`## RStudio`

			`Brief RStudio orientation (code, console, and output). Pointers to where to learn more.`

			`Important keyboard shortcuts:`

			`* Cmd + Enter: sends current line from editor to console.`
			`* Tab: suggest possible completions for the text you've typed.`
			`* Cmd + ↑: in the console, searches all commands you've typed that start with`
			`those characters.`
			`* Cmd + Shift + F10: restart.`
			`* Alt + Shift + K: the keyboard shortcut that shows all the keyboard shortcuts.`

			`Note about turning on save/load session off.`

			`## Getting help`
Need to also discuss where to get help 2015-09-22 20:59:27 +08:00
Bring back the communication section 2015-12-06 23:20:19 +08:00			`* Google. Always a great place to start! Adding "R" to a query is usually`
			`enough to filter it down. If you ever hit an error message that you`
			`don't know how to handle, great idea to google it.`

			`If your operating system defaults to another language, you can use`
			`Sys.setenv(LANGUAGE = "en")` to tell R to use english. That's likely to
			`get you to common solutions more quickly.`

			`* StackOverflow. How to make a reproducible example.`
			`([reprex](https://github.com/jennybc/reprex))`

			`Unfortunately the R stackoverflow community is not always the friendliest.`

			`* Twitter. #rstats hashtag is very welcoming. Great way to keep up with`
			`what's happening in the community.`
Purrr acknowledgements 2015-11-26 04:39:53 +08:00
			`## Acknowledgements`

			`* Jenny Bryan and Lionel Henry for many helpful discussions around working`
			`with lists and list-columns.`
Working on intro 2015-12-08 03:55:44 +08:00
			`## Colophon`

			`This book was built with:`

			```{r}
			`devtools::session_info(pkgs)`
			```