Can't cache xml2 duh
This commit is contained in:
parent
4ff400ff60
commit
28671ed8bd
|
@ -11,7 +11,7 @@ This vignette introduces you to the basics of web scraping with [rvest](https://
|
||||||
Web scraping is a very useful tool for extracting data from web pages.
|
Web scraping is a very useful tool for extracting data from web pages.
|
||||||
Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling.
|
Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling.
|
||||||
Where possible, you should use the API, because typically it will give you more reliable data.
|
Where possible, you should use the API, because typically it will give you more reliable data.
|
||||||
Unfortunately, however, programming with web APIs is out of scope for this book.
|
Unfortunately, however, programming with web APIs is out of scope for this book.
|
||||||
Instead, we are teaching scraping, a technique that works whether or not a site provides an API.
|
Instead, we are teaching scraping, a technique that works whether or not a site provides an API.
|
||||||
|
|
||||||
In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML.
|
In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML.
|
||||||
|
@ -76,7 +76,7 @@ If your work involves scraping personally identifiable information, we strongly
|
||||||
### Copyright
|
### Copyright
|
||||||
|
|
||||||
Finally, you also need to worry about copyright law.
|
Finally, you also need to worry about copyright law.
|
||||||
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "[...] original works of authorship fixed in any tangible medium of expression, [...]".
|
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]".
|
||||||
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
|
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
|
||||||
Notably absent from copyright protection are data.
|
Notably absent from copyright protection are data.
|
||||||
This means that as long as you limit your scraping to facts, copyright protection does not apply.
|
This means that as long as you limit your scraping to facts, copyright protection does not apply.
|
||||||
|
@ -138,11 +138,9 @@ Most elements can have content in between their start and end tags.
|
||||||
This content can either be text or more elements.
|
This content can either be text or more elements.
|
||||||
For example, the following HTML contains paragraph of text, with one word in bold.
|
For example, the following HTML contains paragraph of text, with one word in bold.
|
||||||
|
|
||||||
```
|
<p>
|
||||||
<p>
|
Hi! My <b>name</b> is Hadley.
|
||||||
Hi! My <b>name</b> is Hadley.
|
</p>
|
||||||
</p>
|
|
||||||
```
|
|
||||||
|
|
||||||
The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element.
|
The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element.
|
||||||
The `<b>` element has no children, but it does have contents (the text "name").
|
The `<b>` element has no children, but it does have contents (the text "name").
|
||||||
|
@ -471,7 +469,6 @@ knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)
|
||||||
This data has a clear tabular structure so it's worth starting with `html_table()`:
|
This data has a clear tabular structure so it's worth starting with `html_table()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| cache: true
|
|
||||||
url <- "https://www.imdb.com/chart/top"
|
url <- "https://www.imdb.com/chart/top"
|
||||||
html <- read_html(url)
|
html <- read_html(url)
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue