diff --git a/webscraping.qmd b/webscraping.qmd index 5ec6ce3..f13b88b 100644 --- a/webscraping.qmd +++ b/webscraping.qmd @@ -11,7 +11,7 @@ This vignette introduces you to the basics of web scraping with [rvest](https:// Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling. Where possible, you should use the API, because typically it will give you more reliable data. -Unfortunately, however, programming with web APIs is out of scope for this book. +Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API. In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. @@ -76,7 +76,7 @@ If your work involves scraping personally identifiable information, we strongly ### Copyright Finally, you also need to worry about copyright law. -Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "[...] original works of authorship fixed in any tangible medium of expression, [...]". +Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]". It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more. Notably absent from copyright protection are data. This means that as long as you limit your scraping to facts, copyright protection does not apply. @@ -138,11 +138,9 @@ Most elements can have content in between their start and end tags. This content can either be text or more elements. For example, the following HTML contains paragraph of text, with one word in bold. -``` -

- Hi! My name is Hadley. -

-``` +

+ Hi! My name is Hadley. +

The **children** of a node refers only to elements, so the `

` element above has one child, the `` element. The `` element has no children, but it does have contents (the text "name"). @@ -471,7 +469,6 @@ knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300) This data has a clear tabular structure so it's worth starting with `html_table()`: ```{r} -#| cache: true url <- "https://www.imdb.com/chart/top" html <- read_html(url)