Move learning more to the end of model building

This commit is contained in:
hadley 2016-07-27 17:04:38 -05:00
parent fb37246e62
commit 3104dbfca0
2 changed files with 30 additions and 26 deletions

View File

@ -441,3 +441,30 @@ How do you decide how many parameters to use for the spline? You can either eith
1. It's a little frustrating that Sunday and Saturday are on separate ends 1. It's a little frustrating that Sunday and Saturday are on separate ends
of the plot. Write a small function to set the manipulate the levels of the of the plot. Write a small function to set the manipulate the levels of the
factor so that the week starts on Monday. factor so that the week starts on Monday.
## Learning more
We have only scratched the absolute surface of modelling, but you have hopefully gained some simple, but general purpose tools that you can use to improve your own data analyses. It's ok to start simple! As you've seen, even very simple models can make a dramatic difference in your ability to tease out interactions between variables.
These modelling chapters are even more opinionated than the rest of the book. I approach modelling from a somewhat different perspective to most others, and there is relatively little space devoted to it. Modelling really deserves a book on its own, so I'd highly recommend that you read at least one of these three books:
* *Statistical Modeling: A Fresh Approach* by Danny Kaplan,
<http://www.mosaic-web.org/go/StatisticalModeling/>. This book provides
a gentle introduction to modelling, where you build your intuition,
mathematical tools, and R skills in parallel. The book replaces a traditional
"introduction to statistics" course, providing a curriculum that is up-to-date
and relevant to data science.
* *An Introduction to Statistical Learning* by Gareth James, Daniela Witten,
Trevor Hastie, and Robert Tibshirani, <http://www-bcf.usc.edu/~gareth/ISL/>
(available online for free). This book presents a family of modern modelling
techniques collectively known as statistical learning. For an even deeper
understanding of the math behind the models, read the classic
*Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani, and
Jerome Friedman, <http://statweb.stanford.edu/~tibs/ElemStatLearn/> (also
available online for free).
* *Applied Predictive Modeling* by Max Kuhn and Kjell Johnson,
<http://appliedpredictivemodeling.com>. This book is a companion to the
__caret__ package, and provides practical tools for dealing with real-life
predictive modelling challenges.

View File

@ -33,6 +33,8 @@ This book is not going to give you a deep understanding of the mathematical theo
on the powerful idea of random resamples. These will help you understand on the powerful idea of random resamples. These will help you understand
how your model will behave on new datasets. how your model will behave on new datasets.
## Hypothesis generation vs. hypothesis confirmation
In this book, we are going to use models as a tool for exploration, completing the trifecta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation. In this book, we are going to use models as a tool for exploration, completing the trifecta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation.
Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly: Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:
@ -59,29 +61,4 @@ This is necessary because to confirm a hypothesis you must use data this is inde
This partitioning allows you to explore the training data, occassionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data. This partitioning allows you to explore the training data, occassionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data.
(Note that tven when doing confirmatory modelling, you will still need to do EDA. If you don't do any EDA you will remain blind to the quality problems with your data.) (Note that even when doing confirmatory modelling, you will still need to do EDA. If you don't do any EDA you will remain blind to the quality problems with your data.)
### Other references
The modelling chapters are even more opinionated than the rest of the book. I approach modelling from a somewhat different perspective to most others, and there is relatively little space devoted to it. Modelling really deserves a book on its own, so I'd highly recommend that you read at least one of these three books:
* *Statistical Modeling: A Fresh Approach* by Danny Kaplan,
<http://www.mosaic-web.org/go/StatisticalModeling/>. This book provides
a gentle introduction to modelling, where you build your intuition,
mathematical tools, and R skills in parallel. The book replaces a traditional
"introduction to statistics" course, providing a curriculum that is up-to-date
and relevant to data science.
* *An Introduction to Statistical Learning* by Gareth James, Daniela Witten,
Trevor Hastie, and Robert Tibshirani, <http://www-bcf.usc.edu/~gareth/ISL/>
(available online for free). This book presents a family of modern modelling
techniques collectively known as statistical learning. For an even deeper
understanding of the math behind the models, read the classic
*Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani, and
Jerome Friedman, <http://statweb.stanford.edu/~tibs/ElemStatLearn/> (also
available online for free).
* *Applied Predictive Modeling* by Max Kuhn and Kjell Johnson,
<http://appliedpredictivemodeling.com>. This book is a companion to the
__caret__ package, and provides practical tools for dealing with real-life
predictive modelling challenges.