Can't cache xml2 duh
This commit is contained in:
		| @@ -11,7 +11,7 @@ This vignette introduces you to the basics of web scraping with [rvest](https:// | |||||||
| Web scraping is a very useful tool for extracting data from web pages. | Web scraping is a very useful tool for extracting data from web pages. | ||||||
| Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling. | Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling. | ||||||
| Where possible, you should use the API, because typically it will give you more reliable data. | Where possible, you should use the API, because typically it will give you more reliable data. | ||||||
| Unfortunately, however, programming with web APIs is out of scope for this book.  | Unfortunately, however, programming with web APIs is out of scope for this book. | ||||||
| Instead, we are teaching scraping, a technique that works whether or not a site provides an API. | Instead, we are teaching scraping, a technique that works whether or not a site provides an API. | ||||||
|  |  | ||||||
| In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. | In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. | ||||||
| @@ -76,7 +76,7 @@ If your work involves scraping personally identifiable information, we strongly | |||||||
| ### Copyright | ### Copyright | ||||||
|  |  | ||||||
| Finally, you also need to worry about copyright law. | Finally, you also need to worry about copyright law. | ||||||
| Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "[...] original works of authorship fixed in any tangible medium of expression, [...]". | Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]". | ||||||
| It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more. | It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more. | ||||||
| Notably absent from copyright protection are data. | Notably absent from copyright protection are data. | ||||||
| This means that as long as you limit your scraping to facts, copyright protection does not apply. | This means that as long as you limit your scraping to facts, copyright protection does not apply. | ||||||
| @@ -138,11 +138,9 @@ Most elements can have content in between their start and end tags. | |||||||
| This content can either be text or more elements. | This content can either be text or more elements. | ||||||
| For example, the following HTML contains paragraph of text, with one word in bold. | For example, the following HTML contains paragraph of text, with one word in bold. | ||||||
|  |  | ||||||
| ``` |     <p> | ||||||
| <p> |       Hi! My <b>name</b> is Hadley. | ||||||
|   Hi! My <b>name</b> is Hadley. |     </p> | ||||||
| </p> |  | ||||||
| ``` |  | ||||||
|  |  | ||||||
| The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element. | The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element. | ||||||
| The `<b>` element has no children, but it does have contents (the text "name"). | The `<b>` element has no children, but it does have contents (the text "name"). | ||||||
| @@ -471,7 +469,6 @@ knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300) | |||||||
| This data has a clear tabular structure so it's worth starting with `html_table()`: | This data has a clear tabular structure so it's worth starting with `html_table()`: | ||||||
|  |  | ||||||
| ```{r} | ```{r} | ||||||
| #| cache: true |  | ||||||
| url <- "https://www.imdb.com/chart/top" | url <- "https://www.imdb.com/chart/top" | ||||||
| html <- read_html(url) | html <- read_html(url) | ||||||
|  |  | ||||||
|   | |||||||
		Reference in New Issue
	
	Block a user