Reshape modelling section

This commit is contained in:
hadley 2016-08-15 09:48:24 -05:00
parent 92d7665920
commit 3895c2e791
4 changed files with 7 additions and 16 deletions

View File

@ -27,7 +27,6 @@ rmd_files: [
"model-basics.Rmd",
"model-building.Rmd",
"model-many.Rmd",
"model-assess.Rmd",
"communicate.Rmd",
"reproducible-research.Rmd",

View File

@ -405,7 +405,6 @@ daily %>%
We see a strong pattern in the numbers of Saturday flights. This is reassuring, because we also saw that pattern in the raw data. It's a good sign when you get the same signal from different approaches.
How do you decide how many parameters to use for the spline? You can either either it pick by eye, or you could use automated techniques which you'll learn about in [model assessment]. For exploration, picking by eye to capture the most important patterns is fine.
### Exercises
@ -447,7 +446,7 @@ How do you decide how many parameters to use for the spline? You can either eith
of the plot. Write a small function to set the manipulate the levels of the
factor so that the week starts on Monday.
## Learning more
## Learning more about models
We have only scratched the absolute surface of modelling, but you have hopefully gained some simple, but general purpose tools that you can use to improve your own data analyses. It's ok to start simple! As you've seen, even very simple models can make a dramatic difference in your ability to tease out interactions between variables.

View File

@ -15,8 +15,6 @@ In this chapter you're going to learn three powerful ideas that help you to work
because once you have tidy data, you can apply all of the techniques that
you've learned about in earlier in the book.
These ideas are particularly powerful in conjunction with the ideas of functional programming, so make sure you've read [iteration] before starting this chapter.
We'll start by diving in to a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
The following sections will dive into more detail about the individual techniques:
@ -35,7 +33,7 @@ The following sections will dive into more detail about the individual technique
provided by broom, and see how they can be applied to other types of
data structure.
This chapter focusses on models generated from subsets of your data (exhaustive partition defined by the value of one or more variables). This is a powerful technique for data exploration, and is often a key step on the way to a single complex model that combines the information from all subsets. In the next chapter, you'll learn about another set of techniques that generates multiple models: resampling. Resampling is a powerful tool to help you understand the inferential properties of a model.
This chapter is somewhat aspirational: if this book is your first introduction to R, this chapter is likely to be a struggle. It requires you have to deeply internalised ideas about modelling, data structures, and iteration. So don't worry if you don't get it --- just put this chapter aside for a few months, and come back when you want to stretch your brain.
### Prerequisites

View File

@ -26,18 +26,11 @@ This book is not going to give you a deep understanding of the mathematical theo
understand complex datasets. This is a powerful technique, but to access
it you'll need to combine modelling and programming tools.
* In [model assessment], you'll learn more about the statistical side of
modelling. Ideally, you don't just want a model that works just with the
data that you've observe, but also generalises to new situations. You'll
learn two powerful techniques, cross-validation and bootstrapping, built
on the powerful idea of random resamples. These will help you understand
how your model will behave on new datasets.
These topics are notable because of what they don't include: any tools for quantitatively assessing models. That is deliberate: precisely quantifying a model requires a couple of big ideas that we just don't have the space to cover here. For now, you'll rely on qualitative assessment and your natural scepticism. In [Learning more about models], we'll point you to other resources where you can learn more.
## Hypothesis generation vs. hypothesis confirmation
In this book, we are going to use models as a tool for exploration, completing the trifecta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation.
Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:
In this book, we are going to use models as a tool for exploration, completing the trifecta of the tools for tools EDA that were introduced in Part 1. This is not how models are usually taught, but as you will see, models are an important tool for exploration. Traditionally, the focus of modelling is on inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:
1. Each observation can either be used for exploration or confirmation,
not both.
@ -46,7 +39,9 @@ Models are more common taught as tools for doing inference, or for confirming th
but you can only use it once for confirmation. As soon as you use an
observation twice, you've switched from confirmation to exploration.
This is necessary because to confirm a hypothesis you must use data that is independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading. If you are serious about doing an confirmatory analysis, before you begin the analysis you should split your data up into three pieces:
This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.
If you are serious about doing an confirmatory analysis, one approach is to split your data into three pieces before you begin the analysis:
1. 60% of your data goes into a __training__ (or exploration) set. You're
allowed to do anything you like with this data: visualise it and fit tons