Can't cache xml2 duh

This commit is contained in:
Hadley Wickham 2023-01-12 17:01:49 -06:00
parent 4ff400ff60
commit 28671ed8bd
1 changed files with 5 additions and 8 deletions

View File

@ -76,7 +76,7 @@ If your work involves scraping personally identifiable information, we strongly
### Copyright
Finally, you also need to worry about copyright law.
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "[...] original works of authorship fixed in any tangible medium of expression, [...]".
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]".
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
Notably absent from copyright protection are data.
This means that as long as you limit your scraping to facts, copyright protection does not apply.
@ -138,11 +138,9 @@ Most elements can have content in between their start and end tags.
This content can either be text or more elements.
For example, the following HTML contains paragraph of text, with one word in bold.
```
<p>
Hi! My <b>name</b> is Hadley.
</p>
```
<p>
Hi! My <b>name</b> is Hadley.
</p>
The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element.
The `<b>` element has no children, but it does have contents (the text "name").
@ -471,7 +469,6 @@ knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)
This data has a clear tabular structure so it's worth starting with `html_table()`:
```{r}
#| cache: true
url <- "https://www.imdb.com/chart/top"
html <- read_html(url)