More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-webscraping">
<h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><p>This vignette introduces you to the basics of web scraping with <a href="https://rvest.tidyverse.org">rvest</a>. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>. Where possible, you should use the API, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.</p><p>In this chapter, well first discuss the ethics and legalities of scraping before we dive into the basics of HTML. Youll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. Well then discuss some techniques to figure out what CSS selector you need for the page youre scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.</p>
<section id="prerequisites" data-type="sect2">
<section id="webscraping-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so youll need to load it explicitly. Well also load the full tidyverse since well find it generally useful working with the data weve scraped.</p>
@@ -240,7 +240,7 @@ html |&gt;
<p><code><a href="https://rvest.tidyverse.org/reference/html_attr.html">html_attr()</a></code> always returns a string, so if youre extracting numbers or dates, youll need to do some post-processing.</p>
</section>
<section id="tables" data-type="sect2">
<section id="webscraping-tables" data-type="sect2">
<h2>
Tables</h2>
<p>If youre lucky, your data will be already stored in an HTML table, and itll be a matter of just reading it from that table. Its usually straightforward to recognize a table in your browser: itll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.</p>
@@ -248,22 +248,10 @@ Tables</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">html &lt;- minimal_html("
&lt;table class='mytable'&gt;
&lt;tr&gt;
&lt;th&gt;x&lt;/th&gt;
&lt;th&gt;y&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;td&gt;2.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;td&gt;1.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7.2&lt;/td&gt;
&lt;td&gt;8.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;x&lt;/th&gt; &lt;th&gt;y&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1.5&lt;/td&gt; &lt;td&gt;2.7&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4.9&lt;/td&gt; &lt;td&gt;1.3&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;7.2&lt;/td&gt; &lt;td&gt;8.1&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
")</pre>
</div>
@@ -374,7 +362,6 @@ section |&gt; html_element(".director") |&gt; html_text2()
IMDB top films</h2>
<p>For our next task well tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like <a href="#fig-scraping-imdb" data-type="xref">#fig-scraping-imdb</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)</pre>
<div class="cell-output-display">
<figure id="fig-scraping-imdb"><p><img src="screenshots/scraping-imdb.png" alt="The screenshot shows a table with columns &quot;Rank and Title&quot;, &quot;IMDb Rating&quot;, and &quot;Your Rating&quot;. 9 movies out of the top 250 are shown. The top 5 are the Shawshank Redemption, The Godfather, The Dark Knight, The Godfather: Part II, and 12 Angry Men." width="418"/></p>
@@ -392,14 +379,14 @@ table &lt;- html |&gt;
html_table()
table
#&gt; # A tibble: 250 × 5
#&gt; `` `Rank &amp; Title` `IMDb Rating` `Your Rating` ``
#&gt; &lt;lgl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 NA "1.\n The Shawshank Redemptio… 9.2 "12345678910… NA
#&gt; 2 NA "2.\n The Godfather\n … 9.2 "12345678910… NA
#&gt; 3 NA "3.\n The Dark Knight\n … 9 "12345678910… NA
#&gt; 4 NA "4.\n The Godfather: Part II\… 9 "12345678910… NA
#&gt; 5 NA "5.\n 12 Angry Men\n (… 9 "12345678910… NA
#&gt; 6 NA "6.\n Schindler's List\n … 8.9 "12345678910… NA
#&gt; `` `Rank &amp; Title` `IMDb Rating` `Your Rating` ``
#&gt; &lt;lgl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 NA "1.\n The Shawshank Redempt… 9.2 "12345678910\n… NA
#&gt; 2 NA "2.\n The Godfather\n … 9.2 "12345678910\n… NA
#&gt; 3 NA "3.\n The Dark Knight\n … 9 "12345678910\n… NA
#&gt; 4 NA "4.\n The Godfather: Part I… 9 "12345678910\n… NA
#&gt; 5 NA "5.\n 12 Angry Men\n … 9 "12345678910\n… NA
#&gt; 6 NA "6.\n Schindler's List\n … 8.9 "12345678910\n… NA
#&gt; # … with 244 more rows</pre>
</div>
<p>This includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, well rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> (instead of <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>) to do the renaming and selecting of just these two columns in one step. Then, well apply <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> (from <a href="#sec-extract-variables" data-type="xref">#sec-extract-variables</a>) to pull out the title, year, and rank into their own variables.</p>
@@ -438,12 +425,12 @@ ratings
html_elements("td strong") |&gt;
head() |&gt;
html_attr("title")
#&gt; [1] "9.2 based on 2,684,096 user ratings"
#&gt; [2] "9.2 based on 1,861,107 user ratings"
#&gt; [3] "9.0 based on 2,657,484 user ratings"
#&gt; [4] "9.0 based on 1,273,669 user ratings"
#&gt; [5] "9.0 based on 792,941 user ratings"
#&gt; [6] "8.9 based on 1,357,901 user ratings"</pre>
#&gt; [1] "9.2 based on 2,691,480 user ratings"
#&gt; [2] "9.2 based on 1,867,146 user ratings"
#&gt; [3] "9.0 based on 2,665,189 user ratings"
#&gt; [4] "9.0 based on 1,276,943 user ratings"
#&gt; [5] "9.0 based on 795,129 user ratings"
#&gt; [6] "8.9 based on 1,361,148 user ratings"</pre>
</div>
<p>We can combine this with the tabular data and again apply <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> to extract out the bit of data we care about:</p>
<div class="cell">
@@ -465,12 +452,12 @@ ratings
#&gt; # A tibble: 250 × 5
#&gt; rank title year rating number
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 The Shawshank Redemption 1994 9.2 2684096
#&gt; 2 2 The Godfather 1972 9.2 1861107
#&gt; 3 3 The Dark Knight 2008 9 2657484
#&gt; 4 4 The Godfather: Part II 1974 9 1273669
#&gt; 5 5 12 Angry Men 1957 9 792941
#&gt; 6 6 Schindler's List 1993 8.9 1357901
#&gt; 1 1 The Shawshank Redemption 1994 9.2 2691480
#&gt; 2 2 The Godfather 1972 9.2 1867146
#&gt; 3 3 The Dark Knight 2008 9 2665189
#&gt; 4 4 The Godfather: Part II 1974 9 1276943
#&gt; 5 5 12 Angry Men 1957 9 795129
#&gt; 6 6 Schindler's List 1993 8.9 1361148
#&gt; # … with 244 more rows</pre>
</div>
</section>
@@ -483,7 +470,7 @@ Dynamic sites</h1>
<p>Its still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but its something were actively working on and should be available by the time you read this. It uses the <a href="https://rstudio.github.io/chromote/index.html">chromote package</a> which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details.</p>
</section>
<section id="summary" data-type="sect1">
<section id="webscraping-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned about the why, the why not, and the how of scraping data from web pages. First, youve learned about the basics of HTML and using CSS selectors to refer to specific elements, then youve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.</p>