diff --git a/EDA.Rmd b/EDA.Rmd index 262634e..3dd884d 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -1,9 +1,9 @@ +# Exploratory Data Analysis + ```{r include=FALSE} knitr::opts_chunk$set(fig.height = 2) ``` -# Exploratory Data Analysis (EDA) - ## Introduction This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA is an interative cycle that involves: @@ -552,3 +552,5 @@ ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) + ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) + geom_boxplot() ``` + +Modelling is important because once you have recognised a pattern, a model allows you to make that pattern quantitative and precise, and partition it out from what remains. That supports a powerful interative approach where you indentify a pattern with visualisation, then subtract with a model, allowing you to see the subtler trends that remain. I deliberately chose not to teach modelling yet, because understanding what models are and how they work are easiest once you have some other tools in hand: data wrangling, and programming. diff --git a/communicate.Rmd b/communicate.Rmd index 8127dbe..27483e8 100644 --- a/communicate.Rmd +++ b/communicate.Rmd @@ -2,6 +2,12 @@ # Introduction +The successful completion of a data science project you will have built up a good understand of what is going on with the data. It doesn't matter how brilliant your understand is unless you can communicate it with others. You will need to share your work in a way that your audience can understand. Your audience might be fellow scientists who will want to reproduce the work, non-scientists who will want to understand your findings in plain terms, or yourself (in the future) who will be thankful if you make your work easy to re-learn and recreate. __Part 5__ discusses communication, and how you can use RMarkdown to generate reproducible artefacts that combine prose and code. + +```{r echo = FALSE, out.width = "75%"} +knitr::include_graphics("diagrams/data-science-communicate.png") +``` + Reproducible, literate code is the data science equivalent of the Scientific Report (i.e, Intro, Methods and materials, Results, Discussion). Recommendations for learning more about communication: diff --git a/diagrams/data-science-communicate.png b/diagrams/data-science-communicate.png new file mode 100644 index 0000000..d3b8d00 Binary files /dev/null and b/diagrams/data-science-communicate.png differ diff --git a/diagrams/data-science-explore.png b/diagrams/data-science-explore.png new file mode 100644 index 0000000..bd84df3 Binary files /dev/null and b/diagrams/data-science-explore.png differ diff --git a/diagrams/data-science-model.png b/diagrams/data-science-model.png new file mode 100644 index 0000000..c6eabd2 Binary files /dev/null and b/diagrams/data-science-model.png differ diff --git a/diagrams/data-science-program.png b/diagrams/data-science-program.png new file mode 100644 index 0000000..fb49141 Binary files /dev/null and b/diagrams/data-science-program.png differ diff --git a/diagrams/data-science-wrangle.png b/diagrams/data-science-wrangle.png new file mode 100644 index 0000000..710076a Binary files /dev/null and b/diagrams/data-science-wrangle.png differ diff --git a/diagrams/data-science.graffle b/diagrams/data-science.graffle index 86bdd14..f0ea601 100644 Binary files a/diagrams/data-science.graffle and b/diagrams/data-science.graffle differ diff --git a/explore.Rmd b/explore.Rmd index e3eb1ab..dc5f653 100644 --- a/explore.Rmd +++ b/explore.Rmd @@ -2,6 +2,12 @@ # Introduction +The goal of the first part of this book is to get your up to speed with the basic tools of data exploration as quickly as possible: + +```{r echo = FALSE, out.width = "75%"} +knitr::include_graphics("diagrams/data-science-explore.png") +``` + ```{r setup, include = FALSE} library(ggplot2) library(dplyr) @@ -21,7 +27,7 @@ circle %>% knitr::kable(digits = 2) ``` -While we may stumble over raw data, we can easily process visual information. Within your mind is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle. +While we may stumble over raw data, we can easily process visual information. Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. Within your brain is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle. ```{r echo=FALSE, dependson = data, fig.asp = 1, out.width = "30%", fig.width = 3} ggplot(circle, aes(x, y)) + @@ -29,8 +35,17 @@ ggplot(circle, aes(x, y)) + coord_fixed() ``` -Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, visualization is not the only way to comprehend data. +In the following chapters you will: -You can also comprehend data by transforming it. You can easily attend to a small set of summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, maximums, minimums, medians, and so on. +* Dive into ggplot2 in [data visualisation], learning powerful + and general techniques for turning raw data into visual insights. -Together, visualisation and transformation form a powerful set of tools known as exploratory data analysis, or EDA for short. In this part of the book, you'll learn R through EDA, mastering the minimal set of skills to start gaining insight from your data. +* Visualisation alone is typically not enough, so in [data transformation] + you'll learn the key verbs that allow you select important variables, + filter out key observations, and create new variables and summaries. + +* In [exploratory data analysis], you'll combine visualisation and + transformation with your curiosity and scepticism to ask and answer + interesting questions about data. + +Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet. We'll come back to modelling in [model], once you're better equipped with more data wrangling and programming tools. diff --git a/model.Rmd b/model.Rmd index 2d9423a..5f170bb 100644 --- a/model.Rmd +++ b/model.Rmd @@ -2,6 +2,12 @@ # Introduction +Now that you are equipped with powerful programming tools we can finally return to modelling. You'll use your new tools of data wrangling and programming, to fit many models and understand how they work. The focus of this book is on exploration, not confirmation or formal inference. But you'll learn a few basic tools that help you understand the variation within your models. + +```{r echo = FALSE, out.width = "75%"} +knitr::include_graphics("diagrams/data-science-model.png") +``` + The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data. This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intution about how statisitcal models work, and give you a family of useful tools that allow you to use models to better understand your data: diff --git a/program.Rmd b/program.Rmd index 3055f58..a309e00 100644 --- a/program.Rmd +++ b/program.Rmd @@ -2,9 +2,15 @@ # Introduction +In this part of the book, you'll enrich your programming skills. Programming is a cross-cutting skill needed for all data science work. You must use a computer; you cannot do it in your head, nor with paper and pencil. And to work efficiently, you will need to know how to program in a computer language, such as R. + +```{r echo = FALSE, out.width = "75%"} +knitr::include_graphics("diagrams/data-science-program.png") +``` + Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed. -To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've struggled to solve in the past. +Improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've struggled to solve in the past. In the following chapters, you'll learn important programming skills: diff --git a/wrangle.Rmd b/wrangle.Rmd index e0c2917..bf25ef0 100644 --- a/wrangle.Rmd +++ b/wrangle.Rmd @@ -2,103 +2,29 @@ # Introduction -Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames, but tweak some older behaviours to make life a littler easier. R is an old language, and some things that were true 10 or 20 years ago no longer apply. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier. You can learn more about tibbles in the accompanying vignette: `vignette("tibble")`. +In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form. Data wrangling encompasses three main pieces: -```{r setup} -library(tibble) +```{r echo = FALSE, out.width = "75%"} +knitr::include_graphics("diagrams/data-science-wrangle.png") ``` -## Creating tibbles {#tibbles} +* In [data import], you'll learn the art of data import: how to get your data + off of disk and into R. -The majority of the functions that you'll use in this book already produce tibbles. If you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_tibble()`: +* In [tidy data], you'll learn about tidy data, a consistent way of storing your + data that makes transformation, visualiation, and modelling easier. -```{r} -as_tibble(iris) -``` +* You've already learned the basics of data transformation. In this part of the + book we'll dive deeper into tools useful for specific types of data: -`as_tibble()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables. + * [Dates and times] will give you the key tools for working with + dates, and date times. + + * [Strings] will introduce regular expressions, a powerful tool for + manipulating strings. + + * [Relational data] will give you tools for working with multiple + interrelated datasets. -You can create a new tibble from individual vectors with `tibble()`: +Before we get to those chapters we'll take a brief discussion to discuss the "tibble" in more detail, in [tibbles]. -```{r} -tibble(x = 1:5, y = 1, z = x ^ 2 + y) -``` - -`tibble()` automatically recycles inputs of length 1, and you can refer to variables that you just created. Compared to `data.frame()`, `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`. - -Another way to create a tibble is with `frame_data()`, which is customised for data entry in R code. Column headings are defined by formulas (`~`), and entries are separated by commas: - -```{r} -frame_data( - ~x, ~y, ~z, - "a", 2, 3.6, - "b", 1, 8.5 -) -``` - -## Tibbles vs. data frames - -There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting. - -### Printing - -Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`: - -```{r} -tibble( - a = lubridate::now() + runif(1e3) * 60, - b = lubridate::today() + runif(1e3), - c = 1:1e3, - d = runif(1e3), - e = sample(letters, 1e3, replace = TRUE) -) -``` - -You can control the default appearance with options: - -* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m` - rows, print `n` rows. Use `options(dplyr.print_max = Inf)` to always - show all rows. - -* `options(tibble.width = Inf)` will always print all columns, regardless - of the width of the screen. - -You can see a complete list of options by looking at the package help: `package?tibble`. - -### Subsetting - -Tibbles are stricter about subsetting. If you try to access a variable that does not exist, you'll get a warning. Unlike data frames, tibbles do not use partial matching on column names: - -```{r} -df <- data.frame( - abc = 1:10, - def = runif(10), - xyz = sample(letters, 10) -) -tb <- as_tibble(df) - -df$a -tb$a -``` - -Tibbles clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector. - -```{r} -# With data frames, [ sometimes returns a data frame, and sometimes returns -# a vector -df[, 1] - -# With tibbles, [ always returns another tibble -tb[, 1] - -# To extract a single element, you should always use [[ -tb[[1]] -``` - -## Interacting with legacy code - -Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame: - -```{r} -class(as.data.frame(tb)) -```