From 949966eae22cbbfbbdd81ad5824d67c54639f96e Mon Sep 17 00:00:00 2001 From: seanpwilliams Date: Mon, 29 Aug 2016 06:06:55 -0700 Subject: [PATCH 1/6] intro - minor syntax fixes (#334) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit the largest change was adding “\” to get the twitter handles to render properly --- intro.Rmd | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/intro.Rmd b/intro.Rmd index 2116d36..c25827c 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -26,11 +26,11 @@ The last step of data science is __communication__, an absolutely critical part Surrounding all these tools is __programming__. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease. -You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more. +You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more. ## The tidyverse -The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision you should experience fewer problems when you combine multiple packages to solve real problems. The packages in the tidyverse are not perfect, but they fit together well, and over time that fit will continue to improve. +The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision, you should experience fewer problems when you combine multiple packages to solve real problems. The packages in the tidyverse are not perfect, but they fit together well, and over time that fit will continue to improve. There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data. But we hope that the tidyverse will continue to provide a solid foundation no matter how far you go in R. @@ -52,8 +52,7 @@ The previous description of the tools of data science is organised roughly accor * Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems. We'll give you a selection of programming tools in the middle of the book, and - then you'll see they can combine with the data science tools to tackle interesting - modelling problems. + then you'll see they can combine with the data science tools to tackle interesting modelling problems. Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems. @@ -109,7 +108,7 @@ To run the code in this book, you will need to install both R and the RStudio ID ### RStudio -RStudio is an integrated development environment, or IDE, for R programming. When you get started there two key regions in the interface: +RStudio is an integrated development environment, or IDE, for R programming. When you get started, there two key regions in the interface: ```{r echo = FALSE, out.width = "75%"} knitr::include_graphics("diagrams/rstudio-console.png") @@ -159,7 +158,7 @@ Throughout the book we use a consistent set of conventions to refer to code: ## Getting help and learning more -This book is not an island: there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that I do not answer. This section describes a few tips to help you get help, and to help you keep learning. +This book is not an island; there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that I do not answer. This section describes a few tips to help you get help, and to help you keep learning. If you get stuck, start with Google. Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.) @@ -169,7 +168,7 @@ There are three things you need to include to make your example reproducible: re 1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs. This is a good time to check that you're - using the latest version of each package: it's possible you've discovered + using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed the package. 1. The easiest way to include **data** in a question is to use `dput()` to @@ -197,7 +196,7 @@ There are three things you need to include to make your example reproducible: re Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in. -You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE. +You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE. To keep up with the R community more broadly, we recommend reading : it aggregates over 500 blogs about R from around the world. If you're an active Twitter user, follow the `#rstats` hashtag. Twitter is one of the key tools that Hadley uses to keep up with new developments in the community. From 5f16562857e22b44baea20e389b1de7644b8f081 Mon Sep 17 00:00:00 2001 From: S'busiso Mkhondwane Date: Tue, 30 Aug 2016 00:34:01 +0200 Subject: [PATCH 2/6] Update communicate-plots.Rmd (#327) Typos and somewhere in the chapter you this line "You can use labels in the same way (a character vector the same length as breaks),..." I think you are missing a word "...(a character vector "use" the same length as breaks)" --- communicate-plots.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/communicate-plots.Rmd b/communicate-plots.Rmd index 0221264..7713b1e 100644 --- a/communicate-plots.Rmd +++ b/communicate-plots.Rmd @@ -78,7 +78,7 @@ ggplot(df, aes(x, y)) + ### Exercises -1. Create one plot of the fuel economy data with customized `title`, +1. Create one plot on the fuel economy data with customised `title`, `subtitle`, `caption`, `x`, `y`, and `colour` labels. 1. The `geom_smooth()` is somewhat misleading because the `hwy` for @@ -221,7 +221,7 @@ The only limit is your imagination (and your patience with positioning annotatio ### Exercises -1. Use `geom_text()` with infinite positions to place text at of the +1. Use `geom_text()` with infinite positions to place text at the four corners of the plot. 1. Read the documentation for `annotate()`. How can you use it to add a text @@ -287,7 +287,7 @@ ggplot(mpg, aes(displ, hwy)) + scale_y_continuous(labels = NULL) ``` -You can also use `breaks` and `labels` to control the appearance of legends. Collectively axes and legends are called __guides__. Axes are used for x and y aesthetics; legends are used used for everything else. +You can also use `breaks` and `labels` to control the appearance of legends. Collectively axes and legends are called __guides__. Axes are used for x and y aesthetics; legends are used for everything else. Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term. @@ -484,7 +484,7 @@ In this particular case, you could have simply used faceting, but this technique ## Themes -Finally, you can customize the non-data elements of your plot with a theme: +Finally, you can customise the non-data elements of your plot with a theme: ```{r, message = FALSE} ggplot(mpg, aes(displ, hwy)) + @@ -521,7 +521,7 @@ Generally, however, I think you should be assembling your final reports using R ### Figure sizing -The biggest challenge of graphics in RMarkdown is getting your figures the right size and shape. There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three). +The biggest challenge of graphics in R Markdown is getting your figures the right size and shape. There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three). I only ever use three of the five options: From e8b4980a58326e1ed06e2c9fc84d8e0fc35d7d7e Mon Sep 17 00:00:00 2001 From: seamus-mckinsey Date: Mon, 29 Aug 2016 18:34:25 -0400 Subject: [PATCH 3/6] fixed a typo (#336) fixed a typo --- model-building.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-building.Rmd b/model-building.Rmd index 61b4db1..c144428 100644 --- a/model-building.Rmd +++ b/model-building.Rmd @@ -60,7 +60,7 @@ ggplot(diamonds, aes(carat, price)) + We can make it easier to see how the other attributes of a diamond affect its relative `price` by fitting a model to separate out the effect of `carat`. But first, lets make a couple of tweaks to the diamonds dataset to make it easier to work with: -1. Focus on diamonds bigger smaller than 2.5 carats (99.7% of the data) +1. Focus on diamonds smaller than 2.5 carats (99.7% of the data) 1. Log-transform the carat and price variables. ```{r} From 5ecefc3f2fd831d30064c08a30be85400482723a Mon Sep 17 00:00:00 2001 From: seamus-mckinsey Date: Mon, 29 Aug 2016 18:35:41 -0400 Subject: [PATCH 4/6] another typo (#337) should've bundled these two - will do next time! --- model-building.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-building.Rmd b/model-building.Rmd index c144428..3e2870e 100644 --- a/model-building.Rmd +++ b/model-building.Rmd @@ -116,7 +116,7 @@ ggplot(diamonds2, aes(color, lresid)) + geom_boxplot() ggplot(diamonds2, aes(clarity, lresid)) + geom_boxplot() ``` -Now we see the relationship we expect: as the quality of the diamond increases, so to does it's relative pirce. To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on. A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight. $2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price. +Now we see the relationship we expect: as the quality of the diamond increases, so to does it's relative price. To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on. A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight. $2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price. ### A model complicated model From e08db5b3fbf5902147c21a5600ee2351dcbac147 Mon Sep 17 00:00:00 2001 From: Brett Klamer Date: Tue, 30 Aug 2016 08:55:16 -0400 Subject: [PATCH 5/6] Fixing typos in factors.Rmd (#306) * Fixing typos in factors.Rmd * Update factors.Rmd If 'Those' is more appropriate. --- factors.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/factors.Rmd b/factors.Rmd index 411a181..81bbec6 100644 --- a/factors.Rmd +++ b/factors.Rmd @@ -92,7 +92,7 @@ ggplot(gss_cat, aes(race)) + These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn't yet have a `drop` option, but it will in the future. -When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operation are described in the sections below. +When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below. ### Exercise From 02ad31606df2386f7ea3a685fca732af043cae59 Mon Sep 17 00:00:00 2001 From: Cooper Morris Date: Tue, 30 Aug 2016 07:55:25 -0500 Subject: [PATCH 6/6] Fixed Figure Reference (#339) Was rendering @{ref:dt-algebra} instead of figure number. --- datetimes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datetimes.Rmd b/datetimes.Rmd index 910e120..a1a3e0b 100644 --- a/datetimes.Rmd +++ b/datetimes.Rmd @@ -481,7 +481,7 @@ To find out how many periods fall into an interval, you need to use integer divi How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval. -Figure \@{ref:dt-algebra} summarises permitted arithmetic operations between the different data types. +Figure \@(ref:dt-algebra) summarises permitted arithmetic operations between the different data types. ```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."} knitr::include_graphics("diagrams/datetimes-arithmetic.png")