Intro update (#1333)

* Update contributors

* Trim intro

* Update intro.qmd

* Update intro.qmd

---------

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
Hadley Wickham 2023-03-01 22:51:48 -06:00 committed by GitHub
parent 63cbf8a90d
commit fc631a4509
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 49 additions and 55 deletions

View File

@ -1,15 +1,18 @@
login,n,name,blog
ALShum,1,Alex,www.ALShum.com
Abinashbunty,1,Abinash Satapathy,https://www.abinash.nl/
Adrianzo,1,A. s.,NA
AlanFeder,1,NA,NA
AlbertRapp,1,NA,
AlbertRapp,1,NA,NA
AnttiRask,1,Antti Rask,youcanbeapirate.com
BB1464,1,Oluwafemi OYEDELE,statisticalinference.netlify.app
BarkleyBG,1,Brian G. Barkley,BarkleyBG.netlify.com
BinxiePeterson,1,Bianca Peterson,NA
BirgerNi,1,Birger Niklas,NA
DDClark,1,David Clark,NA
DOH-RPS1303,1,Russell Shean,
DSGeoff,1,NA,NA
Divider85,3,NA,
EdwinTh,4,Edwin Thoen,thats-so-random.com
EricKit,1,Eric Kitaif,NA
GeroVanMi,1,Gerome Meyer,https://astralibra.ch
@ -17,9 +20,12 @@ GoldbergData,1,Josh Goldberg,https://twitter.com/GoldbergData
Iain-S,1,Iain,NA
JeffreyRStevens,2,Jeffrey Stevens,https://decisionslab.unl.edu/
JeldorPKU,1,蒋雨蒙,https://jeldorpku.github.io
KittJonathan,10,Jonathan Kitt,
MJMarshall,2,NA,NA
MarckK,1,Kara de la Marck,https://www.linkedin.com/in/karadelamarck
MattWittbrodt,1,Matt Wittbrodt,mattwittbrodt.com
MatthiasLiew,3,Matthias Liew,
NedJWestern,1,Ned Western,
Nowosad,6,Jakub Nowosad,https://nowosad.github.io
PursuitOfDataScience,14,Y. Yu,https://youzhi.netlify.app/
RIngyao,1,Jajo,NA
@ -29,7 +35,7 @@ ReeceGoding,1,NA,NA
RobinKohrs,1,Robin Kohrs,https://quarantino.netlify.app/
Robinlovelace,2,Robin,http://robinlovelace.net
RodAli,1,Rod Mazloomi,NA
RohanAlexander,1,Rohan Alexander,https://www.rohanalexander.com/
RohanAlexander,5,Rohan Alexander,https://www.rohanalexander.com/
RomeroBarata,1,Romero Morais,NA
ShanEllis,1,Shannon Ellis,shanellis.com
Shurakai,2,Christian Heinrich,NA
@ -38,6 +44,7 @@ a-rosenberg,1,NA,NA
a2800276,1,Tim Becker,NA
adam-gruer,1,Adam Gruer,adamgruer.rbind.io
adidoit,1,adi pradhan,http://adidoit.github.io
aephidayatuloh,1,Aep Hidyatuloh,
agila5,1,Andrea Gilardi,NA
ajay-d,1,Ajay Deonarine,http://deonarine.com/
aleloi,1,NA,NA
@ -63,8 +70,9 @@ bgreenwell,9,Brandon Greenwell,NA
bklamer,11,Brett Klamer,NA
boardtc,1,NA,NA
c-hoh,1,Christian,hohenfeld.is
caddycarine,1,Caddy,
camillevleonard,1,Camille V Leonard,https://www.camillevleonard.com/
canovasjm,1,NA,
canovasjm,1,NA,NA
cedricbatailler,1,Cedric Batailler,cedricbatailler.me
chrMongeau,1,Christian Mongeau,http://mongeau.net
coopermor,2,Cooper Morris,NA
@ -76,10 +84,12 @@ curtisalexander,1,Curtis Alexander,https://www.calex.org
cwarden,2,Christian G. Warden,http://xn.pinkhamster.net/
cwickham,1,Charlotte Wickham,http://cwick.co.nz
darrkj,1,Kenny Darrell,http://darrkj.github.io/blogs
davidrsch,4,David,
davidrubinger,1,David Rubinger,NA
derwinmcgeary,1,Derwin McGeary,http://derwinmcgeary.github.io
dgromer,2,Daniel Gromer,NA
djbirke,1,NA,NA
djnavarro,1,Danielle Navarro,https://djnavarro.net
dongzhuoer,5,Zhuoer Dong,https://dongzhuoer.github.io
dpastoor,2,Devin Pastoor,NA
duju211,13,Julian During,NA
@ -87,6 +97,7 @@ dylancashman,1,Dylan Cashman,https://www.eecs.tufts.edu/~dcashm01/
eddelbuettel,1,Dirk Eddelbuettel,http://dirk.eddelbuettel.com
elgabbas,1,Ahmed El-Gabbas,https://elgabbas.github.io
enryH,1,Henry Webel,NA
ercan7,1,Ercan Karadas,
ericwatt,1,Eric Watt,www.ericdwatt.com
erikerhardt,2,Erik Erhardt,StatAcumen.com
etiennebr,2,Etienne B. Racine,NA
@ -98,9 +109,10 @@ funkybluehen,1,NA,NA
gabrivera,1,NA,NA
gadenbuie,1,Garrick Aden-Buie,https://garrickadenbuie.com
garrettgman,103,Garrett Grolemund,NA
gl-eb,1,Gleb Ebert,glebsite.ch
gridgrad,1,bahadir cankardes,NA
gustavdelius,2,Gustav W Delius,NA
hadley,1085,Hadley Wickham,http://hadley.nz
hadley,1151,Hadley Wickham,http://hadley.nz
hao-trivago,2,Hao Chen,NA
harrismcgehee,7,Harris McGehee,https://gist.github.com/harrismcgehee
hendrikweisser,1,NA,NA
@ -112,27 +124,28 @@ jacobkap,1,Jacob Kaplan,http://crimedatatool.com/
jazzlw,1,Jazz Weisman,NA
jdblischak,1,John Blischak,https://jdblischak.com/
jdstorey,1,John D. Storey,http://jdstorey.github.io/
jeffboichuk,2,Jeff Boichuk,https://www.commerce.virginia.edu/faculty/boichuk
jefferis,1,Gregory Jefferis,http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/gregory-jefferis/
jennybc,5,Jennifer (Jenny) Bryan,https://jennybryan.org
jenren,1,Jen Ren,NA
jeroenjanssens,1,Jeroen Janssens,http://jeroenjanssens.com
jeromecholewa,1,NA,
jeromecholewa,1,NA,NA
jilmun,3,Janet Wesner,jilmun.github.io
jimhester,2,Jim Hester,http://www.jimhester.com
jjchern,6,JJ Chen,NA
jkolacz,1,Jacek Kolacz,NA
joannejang,2,Joanne Jang,joannejang.com
johannes4998,1,NA,
johannes4998,1,NA,NA
johnsears,1,John Sears,NA
jonathanflint,1,NA,NA
jonmcalder,1,Jon Calder,http://joncalder.co.za
jonpage,3,Jonathan Page,economistry.com
jonthegeek,1,Jon Harmon,http://jonthegeek.com
jooyoungseo,2,JooYoung Seo,https://jooyoungseo.github.io
jpetuchovas,1,Justinas Petuchovas,NA
jrdnbradford,1,Jordan,www.linkedin.com/in/jrdnbradford
jrnold,4,Jeffrey Arnold,http://jrnold.me
jroberayalas,7,Jose Roberto Ayala Solares,jroberayalas.netlify.com
jtr13,1,Joyce Robbins,
juandering,1,NA,NA
jules32,1,Julia Stewart Lowndes,http://jules32.github.io
kaetschap,1,Sonja,NA
@ -145,7 +158,7 @@ kirillseva,2,Kirill Sevastyanenko,NA
koalabearski,1,NA,NA
krlmlr,1,Kirill Müller,NA
kucharsky,1,Rafał Kucharski,NA
kwstat,1,Kevin Wright,
kwstat,1,Kevin Wright,NA
landesbergn,1,Noah Landesberg,noahlandesberg.com
lawwu,1,Lawrence Wu,NA
lindbrook,1,NA,NA
@ -155,9 +168,10 @@ matanhakim,1,Matan Hakim,NA
maurolepore,2,Mauro Lepore,https://fgeo.netlify.com/
mbeveridge,7,Mark Beveridge,https://twitter.com/mbeveridge
mcewenkhundi,1,NA,NA
mcsnowface,6,"mcsnowface, PhD",
mfherman,1,Matt Herman,mattherman.info
michaelboerman,1,Michael Boerman,https://michaelboerman.com
mine-cetinkaya-rundel,66,Mine Cetinkaya-Rundel,https://stat.duke.edu/~mc301
mine-cetinkaya-rundel,95,Mine Cetinkaya-Rundel,https://stat.duke.edu/~mc301
mitsuoxv,5,Mitsuo Shiota,https://mitsuoxv.rbind.io/
mjhendrickson,1,Matthew Hendrickson,https://about.me/matthew.j.hendrickson
mmhamdy,1,Mohammed Hamdy,NA
@ -174,8 +188,10 @@ nirmalpatel,2,Nirmal Patel,http://playpowerlabs.com
nischalshrestha,1,Nischal Shrestha,http://nischalshrestha.me
njtierney,1,Nicholas Tierney,http://www.njtierney.com
olivier6088,1,NA,NA
p0bs,1,Robin Penfold,p0bs.com
pabloedug,1,Pablo E. Garcia,NA
padamson,1,Paul Adamson,padamson.github.io
penelopeysm,1,Penelope Y,
peterhurford,1,Peter Hurford,http://www.peterhurford.com
pkq,4,Patrick Kennedy,NA
pooyataher,1,Pooya Taherkhani,https://gitlab.com/pooyat
@ -201,9 +217,12 @@ sfirke,1,Sam Firke,samfirke.com
shoili,1,NA,shoili.github.io
sibusiso16,52,S'busiso Mkhondwane,NA
sonicdoe,11,Jakob Krigovsky,https://sonicdoe.com
stephan-koenig,3,Stephan Koenig,stephankoenig.me
stephenbalogun,6,Stephen Balogun,https://stephenbalogun.github.io/stbalogun/
stragu,4,Stéphane Guillou,https://stragu.github.io/
svenski,1,Sergiusz Bleja,NA
talgalili,1,Tal Galili,https://www.r-statistics.com
tgerarden,1,Todd Gerarden,http://toddgerarden.com
timbroderick,1,Tim Broderick,http://www.timbroderick.net
timwaterhouse,1,Tim Waterhouse,NA
tjmahr,1,TJ Mahr,tjmahr.com
@ -212,14 +231,16 @@ tomjamesprior,1,Tom Prior,NA
tteo,4,Terence Teo,tteo.github.io
twgardner2,1,NA,NA
ulyngs,4,Ulrik Lyngs,www.ulriklyngs.com
uribo,1,Shinya Uryu,https://uribo.hatenablog.com
vanderlindenma,1,Martin Van der Linden,NA
waltersom,1,Walter Somerville,NA
werkstattcodes,1,NA,http://werk.statt.codes
wibeasley,2,Will Beasley,http://scholar.google.com/citations?user=ffsJTC0AAAAJ&hl=en
yihui,4,Yihui Xie,https://yihui.name
yimingli,3,Yiming (Paul) Li,https://yimingli.net
yingxingwu,1,NA,
yutannihilation,1,Hiroaki Yutani,https://twitter.com/yutannihilation
yuyu-aung,1,Yu Yu Aung,NA
zachbogart,1,Zach Bogart,zachbogart.com
zeal626,1,NA,NA
zekiakyol,4,Zeki Akyol,zekiakyol.com
zekiakyol,16,Zeki Akyol,zekiakyol.com

1 login n name blog
2 ALShum 1 Alex www.ALShum.com
3 Abinashbunty 1 Abinash Satapathy https://www.abinash.nl/
4 Adrianzo 1 A. s. NA
5 AlanFeder 1 NA NA
6 AlbertRapp 1 NA NA
7 AnttiRask 1 Antti Rask youcanbeapirate.com
8 BB1464 1 Oluwafemi OYEDELE statisticalinference.netlify.app
9 BarkleyBG 1 Brian G. Barkley BarkleyBG.netlify.com
10 BinxiePeterson 1 Bianca Peterson NA
11 BirgerNi 1 Birger Niklas NA
12 DDClark 1 David Clark NA
13 DOH-RPS1303 1 Russell Shean
14 DSGeoff 1 NA NA
15 Divider85 3 NA
16 EdwinTh 4 Edwin Thoen thats-so-random.com
17 EricKit 1 Eric Kitaif NA
18 GeroVanMi 1 Gerome Meyer https://astralibra.ch
20 Iain-S 1 Iain NA
21 JeffreyRStevens 2 Jeffrey Stevens https://decisionslab.unl.edu/
22 JeldorPKU 1 蒋雨蒙 https://jeldorpku.github.io
23 KittJonathan 10 Jonathan Kitt
24 MJMarshall 2 NA NA
25 MarckK 1 Kara de la Marck https://www.linkedin.com/in/karadelamarck
26 MattWittbrodt 1 Matt Wittbrodt mattwittbrodt.com
27 MatthiasLiew 3 Matthias Liew
28 NedJWestern 1 Ned Western
29 Nowosad 6 Jakub Nowosad https://nowosad.github.io
30 PursuitOfDataScience 14 Y. Yu https://youzhi.netlify.app/
31 RIngyao 1 Jajo NA
35 RobinKohrs 1 Robin Kohrs https://quarantino.netlify.app/
36 Robinlovelace 2 Robin http://robinlovelace.net
37 RodAli 1 Rod Mazloomi NA
38 RohanAlexander 1 5 Rohan Alexander https://www.rohanalexander.com/
39 RomeroBarata 1 Romero Morais NA
40 ShanEllis 1 Shannon Ellis shanellis.com
41 Shurakai 2 Christian Heinrich NA
44 a2800276 1 Tim Becker NA
45 adam-gruer 1 Adam Gruer adamgruer.rbind.io
46 adidoit 1 adi pradhan http://adidoit.github.io
47 aephidayatuloh 1 Aep Hidyatuloh
48 agila5 1 Andrea Gilardi NA
49 ajay-d 1 Ajay Deonarine http://deonarine.com/
50 aleloi 1 NA NA
70 bklamer 11 Brett Klamer NA
71 boardtc 1 NA NA
72 c-hoh 1 Christian hohenfeld.is
73 caddycarine 1 Caddy
74 camillevleonard 1 Camille V Leonard https://www.camillevleonard.com/
75 canovasjm 1 NA NA
76 cedricbatailler 1 Cedric Batailler cedricbatailler.me
77 chrMongeau 1 Christian Mongeau http://mongeau.net
78 coopermor 2 Cooper Morris NA
84 cwarden 2 Christian G. Warden http://xn.pinkhamster.net/
85 cwickham 1 Charlotte Wickham http://cwick.co.nz
86 darrkj 1 Kenny Darrell http://darrkj.github.io/blogs
87 davidrsch 4 David
88 davidrubinger 1 David Rubinger NA
89 derwinmcgeary 1 Derwin McGeary http://derwinmcgeary.github.io
90 dgromer 2 Daniel Gromer NA
91 djbirke 1 NA NA
92 djnavarro 1 Danielle Navarro https://djnavarro.net
93 dongzhuoer 5 Zhuoer Dong https://dongzhuoer.github.io
94 dpastoor 2 Devin Pastoor NA
95 duju211 13 Julian During NA
97 eddelbuettel 1 Dirk Eddelbuettel http://dirk.eddelbuettel.com
98 elgabbas 1 Ahmed El-Gabbas https://elgabbas.github.io
99 enryH 1 Henry Webel NA
100 ercan7 1 Ercan Karadas
101 ericwatt 1 Eric Watt www.ericdwatt.com
102 erikerhardt 2 Erik Erhardt StatAcumen.com
103 etiennebr 2 Etienne B. Racine NA
109 gabrivera 1 NA NA
110 gadenbuie 1 Garrick Aden-Buie https://garrickadenbuie.com
111 garrettgman 103 Garrett Grolemund NA
112 gl-eb 1 Gleb Ebert glebsite.ch
113 gridgrad 1 bahadir cankardes NA
114 gustavdelius 2 Gustav W Delius NA
115 hadley 1085 1151 Hadley Wickham http://hadley.nz
116 hao-trivago 2 Hao Chen NA
117 harrismcgehee 7 Harris McGehee https://gist.github.com/harrismcgehee
118 hendrikweisser 1 NA NA
124 jazzlw 1 Jazz Weisman NA
125 jdblischak 1 John Blischak https://jdblischak.com/
126 jdstorey 1 John D. Storey http://jdstorey.github.io/
jeffboichuk 2 Jeff Boichuk https://www.commerce.virginia.edu/faculty/boichuk
127 jefferis 1 Gregory Jefferis http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/gregory-jefferis/
128 jennybc 5 Jennifer (Jenny) Bryan https://jennybryan.org
129 jenren 1 Jen Ren NA
130 jeroenjanssens 1 Jeroen Janssens http://jeroenjanssens.com
131 jeromecholewa 1 NA NA
132 jilmun 3 Janet Wesner jilmun.github.io
133 jimhester 2 Jim Hester http://www.jimhester.com
134 jjchern 6 JJ Chen NA
135 jkolacz 1 Jacek Kolacz NA
136 joannejang 2 Joanne Jang joannejang.com
137 johannes4998 1 NA NA
138 johnsears 1 John Sears NA
139 jonathanflint 1 NA NA
140 jonmcalder 1 Jon Calder http://joncalder.co.za
141 jonpage 3 Jonathan Page economistry.com
142 jonthegeek 1 Jon Harmon http://jonthegeek.com
143 jooyoungseo 2 JooYoung Seo https://jooyoungseo.github.io
144 jpetuchovas 1 Justinas Petuchovas NA
145 jrdnbradford 1 Jordan www.linkedin.com/in/jrdnbradford
146 jrnold 4 Jeffrey Arnold http://jrnold.me
147 jroberayalas 7 Jose Roberto Ayala Solares jroberayalas.netlify.com
148 jtr13 1 Joyce Robbins
149 juandering 1 NA NA
150 jules32 1 Julia Stewart Lowndes http://jules32.github.io
151 kaetschap 1 Sonja NA
158 koalabearski 1 NA NA
159 krlmlr 1 Kirill Müller NA
160 kucharsky 1 Rafał Kucharski NA
161 kwstat 1 Kevin Wright NA
162 landesbergn 1 Noah Landesberg noahlandesberg.com
163 lawwu 1 Lawrence Wu NA
164 lindbrook 1 NA NA
168 maurolepore 2 Mauro Lepore https://fgeo.netlify.com/
169 mbeveridge 7 Mark Beveridge https://twitter.com/mbeveridge
170 mcewenkhundi 1 NA NA
171 mcsnowface 6 mcsnowface, PhD
172 mfherman 1 Matt Herman mattherman.info
173 michaelboerman 1 Michael Boerman https://michaelboerman.com
174 mine-cetinkaya-rundel 66 95 Mine Cetinkaya-Rundel https://stat.duke.edu/~mc301
175 mitsuoxv 5 Mitsuo Shiota https://mitsuoxv.rbind.io/
176 mjhendrickson 1 Matthew Hendrickson https://about.me/matthew.j.hendrickson
177 mmhamdy 1 Mohammed Hamdy NA
188 nischalshrestha 1 Nischal Shrestha http://nischalshrestha.me
189 njtierney 1 Nicholas Tierney http://www.njtierney.com
190 olivier6088 1 NA NA
191 p0bs 1 Robin Penfold p0bs.com
192 pabloedug 1 Pablo E. Garcia NA
193 padamson 1 Paul Adamson padamson.github.io
194 penelopeysm 1 Penelope Y
195 peterhurford 1 Peter Hurford http://www.peterhurford.com
196 pkq 4 Patrick Kennedy NA
197 pooyataher 1 Pooya Taherkhani https://gitlab.com/pooyat
217 shoili 1 NA shoili.github.io
218 sibusiso16 52 S'busiso Mkhondwane NA
219 sonicdoe 11 Jakob Krigovsky https://sonicdoe.com
220 stephan-koenig 3 Stephan Koenig stephankoenig.me
221 stephenbalogun 6 Stephen Balogun https://stephenbalogun.github.io/stbalogun/
222 stragu 4 Stéphane Guillou https://stragu.github.io/
223 svenski 1 Sergiusz Bleja NA
224 talgalili 1 Tal Galili https://www.r-statistics.com
225 tgerarden 1 Todd Gerarden http://toddgerarden.com
226 timbroderick 1 Tim Broderick http://www.timbroderick.net
227 timwaterhouse 1 Tim Waterhouse NA
228 tjmahr 1 TJ Mahr tjmahr.com
231 tteo 4 Terence Teo tteo.github.io
232 twgardner2 1 NA NA
233 ulyngs 4 Ulrik Lyngs www.ulriklyngs.com
234 uribo 1 Shinya Uryu https://uribo.hatenablog.com
235 vanderlindenma 1 Martin Van der Linden NA
236 waltersom 1 Walter Somerville NA
237 werkstattcodes 1 NA http://werk.statt.codes
238 wibeasley 2 Will Beasley http://scholar.google.com/citations?user=ffsJTC0AAAAJ&hl=en
239 yihui 4 Yihui Xie https://yihui.name
240 yimingli 3 Yiming (Paul) Li https://yimingli.net
241 yingxingwu 1 NA
242 yutannihilation 1 Hiroaki Yutani https://twitter.com/yutannihilation
243 yuyu-aung 1 Yu Yu Aung NA
244 zachbogart 1 Zach Bogart zachbogart.com
245 zeal626 1 NA NA
246 zekiakyol 4 16 Zeki Akyol zekiakyol.com

View File

@ -7,7 +7,7 @@ source("_common.R")
```
Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly, and to have some fun along the way 😃.
After reading this book, you'll have the tools to tackle a wide variety of data science challenges using the best parts of R.
## What you will learn
@ -54,7 +54,7 @@ A good visualization will show you things you did not expect or raise new questi
A good visualization might also hint that you're asking the wrong question or that you need to collect different data.
Visualizations can surprise you, and they don't scale particularly well because they require a human to interpret them.
**Models** are complementary tools to visualization.
**Models** are complementary tools to visualization.
Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well.
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
@ -92,32 +92,21 @@ That means this book can't cover every important topic.
### Modeling
To learn more about modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org) by our colleagues Max Kuhn and Julia Silge.
Modelling is super important for data science, but it's a big topic and unfortunately we just don't have the space to give it the coverage it deserves here.
To learn more modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org) by our colleagues Max Kuhn and Julia Silge.
This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.
### Big data
This book proudly and primarily focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you learn in majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with 1-2 Gb of data.
The tools you learn in majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work a few gigabytes of data.
We'll also show you how to get data out of databases and parquet files, both of which are often used to store big data.
You won't necessarily be able to work with the entire dataset, but that's not a problem because you only need a subset or subsample to answer the question that you're interested in.
That being said, the book also touches on getting data out of databases and out of parquet files, both of which are commonly used solutions for storing big data.
However, if you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn.
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
While the complete dataset might be big, often, the data needed to answer a specific question is small.
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems in disguise.
Each individual problem might fit in memory, but you have millions of them.
For example, you might want to fit a model to each person in your dataset.
This would be trivial if you had just 10 or 100 people; instead, you have a million.
Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like [Hadoop](https://hadoop.apache.org/) or [Spark](https://spark.apache.org/)) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **sparklyr** to solve it for the full dataset.
If you're routinely working with larger data (10-100 Gb, say), we recommend learning more about [data.table](https://github.com/Rdatatable/data.table).
We don't teach it here because it uses a different interface to the tidyverse and requires you ot learn some different conventions.
However, it is incredible faster and the performance payoff is worth investing some time learning it if you're working with large data.
### Python, Julia, and friends
@ -125,22 +114,12 @@ In this book, you won't learn anything about Python, Julia, or any other program
This isn't because we think these tools are bad.
They're not!
And in practice, most data science teams use a mix of languages, often at least R and Python.
However, we strongly believe that it's best to master one tool at a time.
You will get better faster if you dive deep rather than spreading yourself thinly over many topics.
This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next exciting thing.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science.
R is not just a programming language; it is also an interactive environment for doing data science.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility has its downsides, but the big upside is how easy it is to have code that is structured like the problem you are trying to solve for specific parts of the data science process.
These mini languages help you think about problems as a data scientist while supporting fluent interaction between your brain and the computer.
But we strongly believe that it's best to master one tool at a time, and R is a great place to start.
## Prerequisites
We've made a few assumptions about what you already know to get the most out of this book.
You should be generally numerically literate, and it's helpful if you have some programming experience already.
You should be generally numerically literate, and it's helpful if you have some basic programming experience already.
If you've never programmed before, you might find [Hands on Programming with R](https://rstudio-education.github.io/hopr/) by Garrett to be a valuable adjunct to this book.
You need four things to run the code in this book: R, RStudio, a collection of R packages called the **tidyverse**, and a handful of other packages.
@ -149,21 +128,16 @@ They include reusable functions, documentation that describes how to use them, a
### R
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork.
CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages.
Don't pick a mirror close to you; instead, use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork, <https://cloud.r-project.org>.
A new major version of R comes out once a year, and there are 2-3 minor releases each year.
It's a good idea to update regularly.
Upgrading can be a bit of a hassle, especially for major versions requiring you to re-install all your packages, but putting it off only makes it worse.
You'll need at least R 4.1.0 for this book.
We recommend R 4.2.0 or later for this book.
### RStudio
RStudio is an integrated development environment, or IDE, for R programming.
Download and install it from <https://posit.co/download/rstudio-desktop/>.
RStudio is updated a couple of times a year.
When a new version is available, RStudio will let you know.
RStudio is an integrated development environment, or IDE, for R programming, which you can download from <https://posit.co/download/rstudio-desktop/>.
RStudio is updated a couple of times a year, and it will automatically let you know when a new version is out so there's no need to check back.
It's a good idea to upgrade regularly to take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 2022.02.0.
@ -189,7 +163,7 @@ You'll also need to install some R packages.
An R **package** is a collection of functions, data, and documentation that extends the capabilities of base R.
Using packages is key to the successful use of R.
The majority of the packages that you will learn in this book are part of the so-called tidyverse.
All packages in the tidyverse share a common philosophy of data and R programming and are designed to work together naturally.
All packages in the tidyverse share a common philosophy of data and R programming and are designed to work together.
You can install the complete tidyverse with a single line of code:
@ -201,7 +175,6 @@ install.packages("tidyverse")
On your computer, type that line of code in the console, and then press enter to run it.
R will download the packages from CRAN and install them on your computer.
If you have problems installing, make sure that you are connected to the internet and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
You will not be able to use the functions, objects, or help files in a package until you load it with `library()`.
Once you have installed a package, you can load it using the `library()` function:
@ -214,7 +187,7 @@ This tells you that tidyverse loads nine packages: dplyr, forcats, ggplot2, lubr
These are considered the **core** of the tidyverse because you'll use them in almost every analysis.
Packages in the tidyverse change fairly frequently.
You can check whether updates are available and optionally install them by running `tidyverse_update()`.
You can see if updates are available by running `tidyverse_update()`.
### Other packages