Minor updates to intro

This commit is contained in:
Hadley Wickham 2022-08-08 13:47:35 -05:00
parent 5ec12ac2f6
commit 027848d806
1 changed files with 3 additions and 3 deletions

View File

@ -88,7 +88,7 @@ This is the right place to start because you can't tackle big data unless you ha
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data.
If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn.
However, if you're working with large data, the performance payoff is worth the extra effort required to learn it.
However, if you're working with large data, the performance payoff is well worth the effort required to learn it.
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
While the complete data set might be big, often the data needed to answer a specific question is small.
@ -100,7 +100,7 @@ Each individual problem might fit in memory, but you have millions of them.
For example, you might want to fit a model to each person in your dataset.
This would be trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like [Hadoop](https://hadoop.apache.org/) or [Spark](https://spark.apache.org/)) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **sparklyr**, **rhipe**, and **ddr** to solve it for the full dataset.
Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **sparklyr** to solve it for the full dataset.
### Python, Julia, and friends
@ -148,7 +148,7 @@ Download and install it from <http://www.rstudio.com/download>.
RStudio is updated a couple of times a year.
When a new version is available, RStudio will let you know.
It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 1.6.0.
For this book, make sure you have at least RStudio 2022.02.0.
When you start RStudio, you'll see two key regions in the interface: the console pane, and the output pane.