Light editing of intro

This commit is contained in:
Hadley Wickham 2022-02-16 10:20:39 -06:00
parent 2535eeda0e
commit 6376a68ebf
1 changed files with 14 additions and 56 deletions

View File

@ -35,13 +35,6 @@ A good visualisation will show you things that you did not expect, or raise new
A good visualisation might also hint that you're asking the wrong question, or you need to collect different data.
Visualisations can surprise you and don't scale particularly well because they require a human to interpret them.
**Models** are complementary tools to visualisation.
Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well.
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
That means a model cannot fundamentally surprise you.
The last step of data science is **communication**, an absolutely critical part of any data analysis project.
It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
@ -56,18 +49,10 @@ Throughout this book we'll point you to resources where you can learn more.
## How this book is organised
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times).
In our experience, however, this is not the best way to learn them:
- Starting with data ingest and tidying is sub-optimal because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualisation and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.
- Some topics are best explained with other tools.
For example, we believe that it's easier to understand how models work if you already know about visualisation, tidy data, and programming.
- Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems.
We'll give you a selection of programming tools in the middle of the book, and then you'll see how they can combine with the data science tools to tackle interesting modelling problems.
In our experience, however, this is not the best way to learn them because tarting with data ingest and tidying is sub-optimal because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualisation and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book is paired with exercises to help you practice what you've learned.
@ -79,6 +64,8 @@ There are some important topics that this book doesn't cover.
We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible.
That means this book can't cover every important topic.
### Modelling
### Big data
This book proudly focuses on small, in-memory datasets.
@ -118,33 +105,6 @@ To support interaction, R is a much more flexible language than many of its peer
This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
### Non-rectangular data
This book focuses exclusively on rectangular data: collections of values that are each associated with a variable and an observation.
There are lots of datasets that do not naturally fit in this paradigm, including images, sounds, trees, and text.
But rectangular data frames are extremely common in science and industry, and we believe that they are a great place to start your data science journey.
### Hypothesis confirmation
It's possible to divide data analysis into two camps: hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis).
The focus of this book is unabashedly on hypothesis generation, or data exploration.
Here you'll look deeply at the data and, in combination with your subject knowledge, generate many interesting hypotheses to help explain why the data behaves the way it does.
You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
The complement of hypothesis generation is hypothesis confirmation.
Hypothesis confirmation is hard for two reasons:
1. You need a precise mathematical model in order to generate falsifiable predictions.
This often requires considerable statistical sophistication.
2. You can only use an observation once to confirm a hypothesis.
As soon as you use it more than once you're back to doing exploratory analysis.
This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data.
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation.
But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation.
The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
## Prerequisites
We've made a few assumptions about what you already know in order to get the most out of this book.
@ -163,7 +123,8 @@ Don't try and pick a mirror that's close to you: instead use the cloud mirror, <
A new major version of R comes out once a year, and there are 2-3 minor releases each year.
It's a good idea to update regularly.
Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse.
Upgrading can be a bit of a hassle, especially for major versions, which require you to re-install all your packages, but putting it off only makes it worse.
You'll need at least R 4.1.0 for this book.
### RStudio
@ -172,7 +133,7 @@ Download and install it from <http://www.rstudio.com/download>.
RStudio is updated a couple of times a year.
When a new version is available, RStudio will let you know.
It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 1.0.0.
For this book, make sure you have at least RStudio 1.6.0.
When you start RStudio, you'll see two key regions in the interface:
@ -255,7 +216,8 @@ Throughout the book we use a consistent set of conventions to refer to code:
- Other R objects (like data or function arguments) are in a code font, without parentheses, like `flights` or `x`.
- If we want to make it clear what package an object comes from, we'll use the package name followed by two colons, like `dplyr::mutate()`, or\
`nycflights13::flights`. This is also valid R code.
`nycflights13::flights`.
This is also valid R code.
## Getting help and learning more
@ -313,22 +275,18 @@ Twitter is one of the key tools that Hadley uses to keep up with new development
## Acknowledgements
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community.
This book isn't just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community.
There are a few people we'd like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:
- Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.
- The three chapters on workflow were adapted (with permission), from <http://stat545.com/block002_hello-r-workspace-wd-project.html> by Jenny Bryan.
- Genevera Allen for discussions about models, modelling, the statistical learning perspective, and the difference between hypothesis generation and hypothesis confirmation.
- Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown) package, and for tirelessly responding to my feature requests.
- Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford.
- The \#rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.
- Tal Galili for augmenting his dendextend package to support a section on clustering that did not make it into the final draft.
- The #rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.
This book was written in the open, and many people contributed pull requests to fix minor problems.
Special thanks goes to everyone who contributed via GitHub:
@ -351,7 +309,7 @@ cat(".\n")
## Colophon
An online version of this book is available at <http://r4ds.had.co.nz>.
An online version of this book is available at [http://r4ds.had.co.nz](http://r4ds.hadley.nz){.uri}.
It will continue to evolve in between reprints of the physical book.
The source of the book is available at <https://github.com/hadley/r4ds>.
The book is powered by <https://bookdown.org> which makes it easy to turn R Markdown files into HTML, PDF, and EPUB.