Can't cache xml2 duh

This commit is contained in:
Hadley Wickham 2023-01-12 17:01:49 -06:00
parent 4ff400ff60
commit 28671ed8bd
1 changed files with 5 additions and 8 deletions

View File

@ -11,7 +11,7 @@ This vignette introduces you to the basics of web scraping with [rvest](https://
Web scraping is a very useful tool for extracting data from web pages.
Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling.
Where possible, you should use the API, because typically it will give you more reliable data.
Unfortunately, however, programming with web APIs is out of scope for this book.
Unfortunately, however, programming with web APIs is out of scope for this book.
Instead, we are teaching scraping, a technique that works whether or not a site provides an API.
In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML.
@ -76,7 +76,7 @@ If your work involves scraping personally identifiable information, we strongly
### Copyright
Finally, you also need to worry about copyright law.
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "[...] original works of authorship fixed in any tangible medium of expression, [...]".
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]".
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
Notably absent from copyright protection are data.
This means that as long as you limit your scraping to facts, copyright protection does not apply.
@ -138,11 +138,9 @@ Most elements can have content in between their start and end tags.
This content can either be text or more elements.
For example, the following HTML contains paragraph of text, with one word in bold.
```
<p>
Hi! My <b>name</b> is Hadley.
</p>
```
<p>
Hi! My <b>name</b> is Hadley.
</p>
The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element.
The `<b>` element has no children, but it does have contents (the text "name").
@ -471,7 +469,6 @@ knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)
This data has a clear tabular structure so it's worth starting with `html_table()`:
```{r}
#| cache: true
url <- "https://www.imdb.com/chart/top"
html <- read_html(url)