Merge branch 'main' of https://github.com/hadley/r4ds
This commit is contained in:
commit
f023b8a2ee
|
@ -19,7 +19,7 @@ But CSV files aren't very efficient: you have to do quite a lot of work to read
|
|||
In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.
|
||||
|
||||
We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large datasets.
|
||||
We'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
|
||||
We'll use Apache Arrow via the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.
|
||||
As an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.
|
||||
|
||||
Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each.
|
||||
|
|
|
@ -336,7 +336,7 @@ l <- list(
|
|||
```
|
||||
|
||||
The difference between `[` and `[[` is particularly important for lists because `[[` drills down into the list while `[` returns a new, smaller list.
|
||||
To help you remember the difference, take a look at the an unusual pepper shaker shown in @fig-pepper.
|
||||
To help you remember the difference, take a look at the unusual pepper shaker shown in @fig-pepper.
|
||||
If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet.
|
||||
If we suppose this pepper shaker is a list called `pepper`, then `pepper[1]` is a pepper shaker containing a single pepper packet.
|
||||
`pepper[2]` would look the same, but would contain the second packet.
|
||||
|
|
|
@ -10,9 +10,9 @@ BarkleyBG,1,Brian G. Barkley,BarkleyBG.netlify.com
|
|||
BinxiePeterson,1,Bianca Peterson,NA
|
||||
BirgerNi,1,Birger Niklas,NA
|
||||
DDClark,1,David Clark,NA
|
||||
DOH-RPS1303,1,Russell Shean,
|
||||
DOH-RPS1303,1,Russell Shean,NA
|
||||
DSGeoff,1,NA,NA
|
||||
Divider85,3,NA,
|
||||
Divider85,3,NA,NA
|
||||
EdwinTh,4,Edwin Thoen,thats-so-random.com
|
||||
EricKit,1,Eric Kitaif,NA
|
||||
GeroVanMi,1,Gerome Meyer,https://astralibra.ch
|
||||
|
@ -20,12 +20,12 @@ GoldbergData,1,Josh Goldberg,https://twitter.com/GoldbergData
|
|||
Iain-S,1,Iain,NA
|
||||
JeffreyRStevens,2,Jeffrey Stevens,https://decisionslab.unl.edu/
|
||||
JeldorPKU,1,蒋雨蒙,https://jeldorpku.github.io
|
||||
KittJonathan,10,Jonathan Kitt,
|
||||
KittJonathan,10,Jonathan Kitt,NA
|
||||
MJMarshall,2,NA,NA
|
||||
MarckK,1,Kara de la Marck,https://www.linkedin.com/in/karadelamarck
|
||||
MattWittbrodt,1,Matt Wittbrodt,mattwittbrodt.com
|
||||
MatthiasLiew,3,Matthias Liew,
|
||||
NedJWestern,1,Ned Western,
|
||||
MatthiasLiew,3,Matthias Liew,NA
|
||||
NedJWestern,1,Ned Western,NA
|
||||
Nowosad,6,Jakub Nowosad,https://nowosad.github.io
|
||||
PursuitOfDataScience,14,Y. Yu,https://youzhi.netlify.app/
|
||||
RIngyao,1,Jajo,NA
|
||||
|
@ -44,7 +44,7 @@ a-rosenberg,1,NA,NA
|
|||
a2800276,1,Tim Becker,NA
|
||||
adam-gruer,1,Adam Gruer,adamgruer.rbind.io
|
||||
adidoit,1,adi pradhan,http://adidoit.github.io
|
||||
aephidayatuloh,1,Aep Hidyatuloh,
|
||||
aephidayatuloh,1,Aep Hidyatuloh,NA
|
||||
agila5,1,Andrea Gilardi,NA
|
||||
ajay-d,1,Ajay Deonarine,http://deonarine.com/
|
||||
aleloi,1,NA,NA
|
||||
|
@ -70,7 +70,7 @@ bgreenwell,9,Brandon Greenwell,NA
|
|||
bklamer,11,Brett Klamer,NA
|
||||
boardtc,1,NA,NA
|
||||
c-hoh,1,Christian,hohenfeld.is
|
||||
caddycarine,1,Caddy,
|
||||
caddycarine,1,Caddy,NA
|
||||
camillevleonard,1,Camille V Leonard,https://www.camillevleonard.com/
|
||||
canovasjm,1,NA,NA
|
||||
cedricbatailler,1,Cedric Batailler,cedricbatailler.me
|
||||
|
@ -84,7 +84,7 @@ curtisalexander,1,Curtis Alexander,https://www.calex.org
|
|||
cwarden,2,Christian G. Warden,http://xn.pinkhamster.net/
|
||||
cwickham,1,Charlotte Wickham,http://cwick.co.nz
|
||||
darrkj,1,Kenny Darrell,http://darrkj.github.io/blogs
|
||||
davidrsch,4,David,
|
||||
davidrsch,5,David,NA
|
||||
davidrubinger,1,David Rubinger,NA
|
||||
derwinmcgeary,1,Derwin McGeary,http://derwinmcgeary.github.io
|
||||
dgromer,2,Daniel Gromer,NA
|
||||
|
@ -97,7 +97,7 @@ dylancashman,1,Dylan Cashman,https://www.eecs.tufts.edu/~dcashm01/
|
|||
eddelbuettel,1,Dirk Eddelbuettel,http://dirk.eddelbuettel.com
|
||||
elgabbas,1,Ahmed El-Gabbas,https://elgabbas.github.io
|
||||
enryH,1,Henry Webel,NA
|
||||
ercan7,1,Ercan Karadas,
|
||||
ercan7,1,Ercan Karadas,NA
|
||||
ericwatt,1,Eric Watt,www.ericdwatt.com
|
||||
erikerhardt,2,Erik Erhardt,StatAcumen.com
|
||||
etiennebr,2,Etienne B. Racine,NA
|
||||
|
@ -112,7 +112,7 @@ garrettgman,103,Garrett Grolemund,NA
|
|||
gl-eb,1,Gleb Ebert,glebsite.ch
|
||||
gridgrad,1,bahadir cankardes,NA
|
||||
gustavdelius,2,Gustav W Delius,NA
|
||||
hadley,1151,Hadley Wickham,http://hadley.nz
|
||||
hadley,1166,Hadley Wickham,http://hadley.nz
|
||||
hao-trivago,2,Hao Chen,NA
|
||||
harrismcgehee,7,Harris McGehee,https://gist.github.com/harrismcgehee
|
||||
hendrikweisser,1,NA,NA
|
||||
|
@ -145,7 +145,7 @@ jpetuchovas,1,Justinas Petuchovas,NA
|
|||
jrdnbradford,1,Jordan,www.linkedin.com/in/jrdnbradford
|
||||
jrnold,4,Jeffrey Arnold,http://jrnold.me
|
||||
jroberayalas,7,Jose Roberto Ayala Solares,jroberayalas.netlify.com
|
||||
jtr13,1,Joyce Robbins,
|
||||
jtr13,1,Joyce Robbins,NA
|
||||
juandering,1,NA,NA
|
||||
jules32,1,Julia Stewart Lowndes,http://jules32.github.io
|
||||
kaetschap,1,Sonja,NA
|
||||
|
@ -168,10 +168,10 @@ matanhakim,1,Matan Hakim,NA
|
|||
maurolepore,2,Mauro Lepore,https://fgeo.netlify.com/
|
||||
mbeveridge,7,Mark Beveridge,https://twitter.com/mbeveridge
|
||||
mcewenkhundi,1,NA,NA
|
||||
mcsnowface,6,"mcsnowface, PhD",
|
||||
mcsnowface,6,"mcsnowface, PhD",NA
|
||||
mfherman,1,Matt Herman,mattherman.info
|
||||
michaelboerman,1,Michael Boerman,https://michaelboerman.com
|
||||
mine-cetinkaya-rundel,95,Mine Cetinkaya-Rundel,https://stat.duke.edu/~mc301
|
||||
mine-cetinkaya-rundel,119,Mine Cetinkaya-Rundel,https://stat.duke.edu/~mc301
|
||||
mitsuoxv,5,Mitsuo Shiota,https://mitsuoxv.rbind.io/
|
||||
mjhendrickson,1,Matthew Hendrickson,https://about.me/matthew.j.hendrickson
|
||||
mmhamdy,1,Mohammed Hamdy,NA
|
||||
|
@ -188,10 +188,11 @@ nirmalpatel,2,Nirmal Patel,http://playpowerlabs.com
|
|||
nischalshrestha,1,Nischal Shrestha,http://nischalshrestha.me
|
||||
njtierney,1,Nicholas Tierney,http://www.njtierney.com
|
||||
olivier6088,1,NA,NA
|
||||
oliviercailloux,1,Olivier Cailloux,https://www.lamsade.dauphine.fr/~ocailloux/
|
||||
p0bs,1,Robin Penfold,p0bs.com
|
||||
pabloedug,1,Pablo E. Garcia,NA
|
||||
padamson,1,Paul Adamson,padamson.github.io
|
||||
penelopeysm,1,Penelope Y,
|
||||
penelopeysm,1,Penelope Y,NA
|
||||
peterhurford,1,Peter Hurford,http://www.peterhurford.com
|
||||
pkq,4,Patrick Kennedy,NA
|
||||
pooyataher,1,Pooya Taherkhani,https://gitlab.com/pooyat
|
||||
|
@ -238,7 +239,7 @@ werkstattcodes,1,NA,http://werk.statt.codes
|
|||
wibeasley,2,Will Beasley,http://scholar.google.com/citations?user=ffsJTC0AAAAJ&hl=en
|
||||
yihui,4,Yihui Xie,https://yihui.name
|
||||
yimingli,3,Yiming (Paul) Li,https://yimingli.net
|
||||
yingxingwu,1,NA,
|
||||
yingxingwu,1,NA,NA
|
||||
yutannihilation,1,Hiroaki Yutani,https://twitter.com/yutannihilation
|
||||
yuyu-aung,1,Yu Yu Aung,NA
|
||||
zachbogart,1,Zach Bogart,zachbogart.com
|
||||
|
|
|
|
@ -195,7 +195,7 @@ billboard |>
|
|||
After the data, there are three key arguments:
|
||||
|
||||
- `cols` specifies which columns need to be pivoted, i.e. which columns aren't variables. This argument uses the same syntax as `select()` so here we could use `!c(artist, track, date.entered)` or `starts_with("wk")`.
|
||||
- `names_to` names of the variable stored in the column names, we named that variable `week`.
|
||||
- `names_to` names the variable stored in the column names, we named that variable `week`.
|
||||
- `values_to` names the variable stored in the cell values, we named that variable `rank`.
|
||||
|
||||
Note that in the code `"week"` and `"rank"` are quoted because those are new variables we're creating, they don't yet exist in the data when we run the `pivot_longer()` call.
|
||||
|
@ -448,7 +448,7 @@ knitr::include_graphics("diagrams/tidy-data/names-and-values.png", dpi = 270)
|
|||
## Widening data
|
||||
|
||||
So far we've used `pivot_longer()` to solve the common class of problems where values have ended up in column names.
|
||||
Next we'll pivot (HA HA) to `pivot_wider()`, which which makes datasets **wider** by increasing columns and reducing rows and helps when one observation is spread across multiple rows.
|
||||
Next we'll pivot (HA HA) to `pivot_wider()`, which makes datasets **wider** by increasing columns and reducing rows and helps when one observation is spread across multiple rows.
|
||||
This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.
|
||||
|
||||
We'll start by looking at `cms_patient_experience`, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:
|
||||
|
|
|
@ -225,7 +225,7 @@ flights |>
|
|||
|
||||
### Exercises
|
||||
|
||||
1. In a singe pipeline, find all flights that meet all of the following conditions:
|
||||
1. In a single pipeline, find all flights that meet all of the following conditions:
|
||||
|
||||
- Had an arrival delay of two or more hours
|
||||
- Flew to Houston (`IAH` or `HOU`)
|
||||
|
|
|
@ -795,4 +795,4 @@ Working with dates and times can seem harder than necessary, but hopefully this
|
|||
Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.
|
||||
|
||||
The next chapter gives a round up of missing values.
|
||||
You've seen them in a few places and have no doubt encounter in your own analysis, and it's how time to provide a grab bag of useful techniques for dealing with them.
|
||||
You've seen them in a few places and have no doubt encounter in your own analysis, and it's now time to provide a grab bag of useful techniques for dealing with them.
|
||||
|
|
|
@ -72,7 +72,7 @@ df |> mutate(
|
|||
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
|
||||
But did you spot the mistake?
|
||||
When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an `a` to a `b`.
|
||||
Preventing this type of mistake of is one very good reason to learn how to write functions.
|
||||
Preventing this type of mistake is one very good reason to learn how to write functions.
|
||||
|
||||
### Writing a function
|
||||
|
||||
|
@ -611,7 +611,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
|
||||
```{r}
|
||||
#| eval: false
|
||||
weather |> standardize_time(sched_dep_time)
|
||||
flights |> standardize_time(sched_dep_time)
|
||||
```
|
||||
|
||||
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
|
||||
|
|
|
@ -11,7 +11,7 @@ You'll also learn how to manage cognitive resources to facilitate discoveries wh
|
|||
|
||||
This website is and will always be free, licensed under the [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) License.
|
||||
If you'd like a physical copy of the book, you can order the 1st edition on [Amazon](https://amzn.to/2aHLAQ1), or wait until mid-2023 for the 2nd edition.
|
||||
If appreciate reading the book for free and would like to give back please make a donation to [Kākāpō Recovery](https://www.doc.govt.nz/kakapo-donate): the [kākāpō](https://www.youtube.com/watch?v=9T1vfsHYiKY) (which appears on the cover of R4DS) is a critically endangered native NZ parrot; there are only 252 left.
|
||||
If you appreciate reading the book for free and would like to give back, please make a donation to [Kākāpō Recovery](https://www.doc.govt.nz/kakapo-donate): the [kākāpō](https://www.youtube.com/watch?v=9T1vfsHYiKY) (which appears on the cover of R4DS) is a critically endangered parrot native to New Zealand; there are only 248 left.
|
||||
|
||||
If you speak another language, you might be interested in the freely available translations of the 1st edition:
|
||||
|
||||
|
|
21
intro.qmd
21
intro.qmd
|
@ -199,7 +199,7 @@ In other words, the complement to the tidyverse is not the messyverse but many o
|
|||
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
|
||||
|
||||
We'll use many packages from outside the tidyverse in this book.
|
||||
For example, we use the following packages to that provide interesting data sets:
|
||||
For example, we'll use the following packages because they provide interesting data sets for us to work with in the process of learning R:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -252,20 +252,7 @@ Throughout the book, we use a consistent set of conventions to refer to code:
|
|||
## Acknowledgments
|
||||
|
||||
This book isn't just the product of Hadley, Mine, and Garrett but is the result of many conversations (in person and online) that we've had with many people in the R community.
|
||||
There are a few people we'd like to thank in particular because they have spent many hours answering our questions and helping us to better think about data science:
|
||||
|
||||
- Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.
|
||||
|
||||
- The three chapters on workflow were adapted (with permission) from <https://stat545.com/block002_hello-r-workspace-wd-project.html> by Jenny Bryan.
|
||||
|
||||
- Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown) package and for tirelessly responding to my feature requests.
|
||||
|
||||
- Bill Behrman for his thoughtful reading of the entire book and for trying it out with his data science class at Stanford.
|
||||
|
||||
- The #rstats Twitter community who reviewed all of the draft chapters and provided tons of helpful feedback.
|
||||
|
||||
This book was written in the open, and many people contributed pull requests to fix minor problems.
|
||||
Special thanks go to everyone who contributed via GitHub:
|
||||
We're incredibly grateful for all the conversations we've had with y'all; thank you so much!
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -277,7 +264,7 @@ contribs_all_json <- gh::gh("/repos/:owner/:repo/contributors",
|
|||
repo = "r4ds",
|
||||
.limit = Inf
|
||||
)
|
||||
contribs_all <- tibble(
|
||||
contribs_all <- tibble(,
|
||||
login = contribs_all_json %>% map_chr("login"),
|
||||
n = contribs_all_json %>% map_int("contributions")
|
||||
)
|
||||
|
@ -319,7 +306,7 @@ contributors <- contributors %>%
|
|||
desc = ifelse(is.na(name), login, paste0(name, " (", login, ")"))
|
||||
)
|
||||
|
||||
cat("A big thank you to all ", nrow(contributors), " people who contributed specific improvements via GitHub pull requests (in alphabetical order by username): ", sep = "")
|
||||
cat("This book was written in the open, and many people contributed via pull requests. A special thanks to all ",nrow(contributors), " of you who contributed improvements via GitHub pull requests (in alphabetical order by username): ", sep = "")
|
||||
cat(paste0(contributors$desc, collapse = ", "))
|
||||
cat(".\n")
|
||||
```
|
||||
|
|
|
@ -118,7 +118,7 @@ In simple cases, as above, this will be a single existing function.
|
|||
This is a pretty special feature of R: we're passing one function (`median`, `mean`, `str_flatten`, ...) to another function (`across`).
|
||||
This is one of the features that makes R a functional programming language.
|
||||
|
||||
It's important to note that we're passing this function to `across()`, so `across()` can call it; we're calling it ourselves.
|
||||
It's important to note that we're passing this function to `across()`, so `across()` can call it; we're not calling it ourselves.
|
||||
That means the function name should never be followed by `()`.
|
||||
If you forget, you'll get an error:
|
||||
|
||||
|
@ -538,7 +538,7 @@ list(
|
|||
)
|
||||
```
|
||||
|
||||
So we can use `map()` get a list of 12 data frames:
|
||||
So we can use `map()` to get a list of 12 data frames:
|
||||
|
||||
```{r}
|
||||
files <- map(paths, readxl::read_excel)
|
||||
|
|
|
@ -373,7 +373,7 @@ x <- 1:10
|
|||
cumsum(x)
|
||||
```
|
||||
|
||||
If you need more complex rolling or sliding aggregates, try the [slider](https://davisvaughan.github.io/slider/) package by Davis Vaughan.
|
||||
If you need more complex rolling or sliding aggregates, try the [slider](https://slider.r-lib.org/) package by Davis Vaughan.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
|
|
@ -484,7 +484,7 @@ sentences |>
|
|||
str_view()
|
||||
```
|
||||
|
||||
If you want extract the matches for each group you can use `str_match()`.
|
||||
If you want to extract the matches for each group you can use `str_match()`.
|
||||
But `str_match()` returns a matrix, so it's not particularly easy to work with[^regexps-8]:
|
||||
|
||||
[^regexps-8]: Mostly because we never discuss matrices in this book!
|
||||
|
@ -554,7 +554,7 @@ str_match(x, "gr(?:e|a)y")
|
|||
## Pattern control
|
||||
|
||||
It's possible to exercise extra control over the details of the match by using a pattern object instead of just a string.
|
||||
This allows you control the so called regex flags and match various types of fixed strings, as described below.
|
||||
This allows you to control the so called regex flags and match various types of fixed strings, as described below.
|
||||
|
||||
### Regex flags {#sec-flags}
|
||||
|
||||
|
|
|
@ -226,7 +226,7 @@ df <- tribble(
|
|||
"Marvin", "nectarine",
|
||||
"Terence", "cantaloupe",
|
||||
"Terence", "papaya",
|
||||
"Terence", "madarin"
|
||||
"Terence", "mandarin"
|
||||
)
|
||||
df |>
|
||||
group_by(name) |>
|
||||
|
|
|
@ -70,7 +70,7 @@ Note, however, the situation is rather different in Europe where courts have fou
|
|||
### Personally identifiable information
|
||||
|
||||
Even if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc.
|
||||
Europe has particularly strict laws about the collection of storage of such data ([GDPR](https://gdpr-info.eu/)), and regardless of where you live you're likely to be entering an ethical quagmire.
|
||||
Europe has particularly strict laws about the collection or storage of such data ([GDPR](https://gdpr-info.eu/)), and regardless of where you live you're likely to be entering an ethical quagmire.
|
||||
For example, in 2016, a group of researchers scraped public profile information (e.g. usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization.
|
||||
While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset.
|
||||
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study[^webscraping-4] as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.
|
||||
|
@ -81,7 +81,7 @@ If your work involves scraping personally identifiable information, we strongly
|
|||
|
||||
Finally, you also need to worry about copyright law.
|
||||
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]".
|
||||
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
|
||||
It then goes on to describe specific categories that it applies like literary works, musical works, motion pictures and more.
|
||||
Notably absent from copyright protection are data.
|
||||
This means that as long as you limit your scraping to facts, copyright protection does not apply.
|
||||
(But note that Europe has a separate "[sui generis](https://en.wikipedia.org/wiki/Database_right)" right that protects databases.)
|
||||
|
|
Loading…
Reference in New Issue