Can't cache xml2 duh
This commit is contained in:
parent
4ff400ff60
commit
28671ed8bd
|
@ -11,7 +11,7 @@ This vignette introduces you to the basics of web scraping with [rvest](https://
|
|||
Web scraping is a very useful tool for extracting data from web pages.
|
||||
Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling.
|
||||
Where possible, you should use the API, because typically it will give you more reliable data.
|
||||
Unfortunately, however, programming with web APIs is out of scope for this book.
|
||||
Unfortunately, however, programming with web APIs is out of scope for this book.
|
||||
Instead, we are teaching scraping, a technique that works whether or not a site provides an API.
|
||||
|
||||
In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML.
|
||||
|
@ -76,7 +76,7 @@ If your work involves scraping personally identifiable information, we strongly
|
|||
### Copyright
|
||||
|
||||
Finally, you also need to worry about copyright law.
|
||||
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "[...] original works of authorship fixed in any tangible medium of expression, [...]".
|
||||
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]".
|
||||
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
|
||||
Notably absent from copyright protection are data.
|
||||
This means that as long as you limit your scraping to facts, copyright protection does not apply.
|
||||
|
@ -138,11 +138,9 @@ Most elements can have content in between their start and end tags.
|
|||
This content can either be text or more elements.
|
||||
For example, the following HTML contains paragraph of text, with one word in bold.
|
||||
|
||||
```
|
||||
<p>
|
||||
Hi! My <b>name</b> is Hadley.
|
||||
</p>
|
||||
```
|
||||
<p>
|
||||
Hi! My <b>name</b> is Hadley.
|
||||
</p>
|
||||
|
||||
The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element.
|
||||
The `<b>` element has no children, but it does have contents (the text "name").
|
||||
|
@ -471,7 +469,6 @@ knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)
|
|||
This data has a clear tabular structure so it's worth starting with `html_table()`:
|
||||
|
||||
```{r}
|
||||
#| cache: true
|
||||
url <- "https://www.imdb.com/chart/top"
|
||||
html <- read_html(url)
|
||||
|
||||
|
|
Loading…
Reference in New Issue