More minor page count tweaks & fixes

And re-convert with latest htmlbook
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions
--- a/oreilly/webscraping.html
+++ b/oreilly/webscraping.html
@@ -1,6 +1,6 @@
 <section data-type="chapter" id="chp-webscraping">
 <h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><p>This vignette introduces you to the basics of web scraping with <a href="https://rvest.tidyverse.org">rvest</a>. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>. Where possible, you should use the API, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.</p><p>In this chapter, we’ll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. You’ll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. We’ll then discuss some techniques to figure out what CSS selector you need for the page you’re scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.</p>
-<section id="prerequisites" data-type="sect2">
+<section id="webscraping-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <p>In this chapter, we’ll focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so you’ll need to load it explicitly. We’ll also load the full tidyverse since we’ll find it generally useful working with the data we’ve scraped.</p>
@@ -240,7 +240,7 @@ html |&gt;
 <p><code><a href="https://rvest.tidyverse.org/reference/html_attr.html">html_attr()</a></code> always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.</p>
 </section>

-<section id="tables" data-type="sect2">
+<section id="webscraping-tables" data-type="sect2">
 <h2>
 Tables</h2>
 <p>If you’re lucky, your data will be already stored in an HTML table, and it’ll be a matter of just reading it from that table. It’s usually straightforward to recognize a table in your browser: it’ll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.</p>
@@ -248,22 +248,10 @@ Tables</h2>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">html &lt;- minimal_html("
  &lt;table class='mytable'&gt;
-    &lt;tr&gt;
-      &lt;th&gt;x&lt;/th&gt;
-      &lt;th&gt;y&lt;/th&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;1.5&lt;/td&gt;
-      &lt;td&gt;2.7&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;4.9&lt;/td&gt;
-      &lt;td&gt;1.3&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;7.2&lt;/td&gt;
-      &lt;td&gt;8.1&lt;/td&gt;
-    &lt;/tr&gt;
+    &lt;tr&gt;&lt;th&gt;x&lt;/th&gt;   &lt;th&gt;y&lt;/th&gt;&lt;/tr&gt;
+    &lt;tr&gt;&lt;td&gt;1.5&lt;/td&gt; &lt;td&gt;2.7&lt;/td&gt;&lt;/tr&gt;
+    &lt;tr&gt;&lt;td&gt;4.9&lt;/td&gt; &lt;td&gt;1.3&lt;/td&gt;&lt;/tr&gt;
+    &lt;tr&gt;&lt;td&gt;7.2&lt;/td&gt; &lt;td&gt;8.1&lt;/td&gt;&lt;/tr&gt;
  &lt;/table&gt;
  ")</pre>
 </div>
@@ -374,7 +362,6 @@ section |&gt; html_element(".director") |&gt; html_text2()
 IMDB top films</h2>
 <p>For our next task we’ll tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like <a href="#fig-scraping-imdb" data-type="xref">#fig-scraping-imdb</a>.</p>
 <div class="cell">
-<pre data-type="programlisting" data-code-language="r">knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)</pre>
 <div class="cell-output-display">

 <figure id="fig-scraping-imdb"><p><img src="screenshots/scraping-imdb.png" alt="The screenshot shows a table with columns &quot;Rank and Title&quot;, &quot;IMDb Rating&quot;, and &quot;Your Rating&quot;. 9 movies out of the top 250 are shown. The top 5 are the Shawshank Redemption, The Godfather, The Dark Knight, The Godfather: Part II, and 12 Angry Men." width="418"/></p>
@@ -392,14 +379,14 @@ table &lt;- html |&gt;
  html_table()
 table
 #&gt; # A tibble: 250 × 5
-#&gt;   ``    `Rank &amp; Title`                      `IMDb Rating` `Your Rating` ``   
-#&gt;   &lt;lgl&gt; &lt;chr&gt;                                       &lt;dbl&gt; &lt;chr&gt;         &lt;lgl&gt;
-#&gt; 1 NA    "1.\n      The Shawshank Redemptio…           9.2 "12345678910… NA   
-#&gt; 2 NA    "2.\n      The Godfather\n        …           9.2 "12345678910… NA   
-#&gt; 3 NA    "3.\n      The Dark Knight\n      …           9   "12345678910… NA   
-#&gt; 4 NA    "4.\n      The Godfather: Part II\…           9   "12345678910… NA   
-#&gt; 5 NA    "5.\n      12 Angry Men\n        (…           9   "12345678910… NA   
-#&gt; 6 NA    "6.\n      Schindler's List\n     …           8.9 "12345678910… NA   
+#&gt;   ``    `Rank &amp; Title`                    `IMDb Rating` `Your Rating`   ``   
+#&gt;   &lt;lgl&gt; &lt;chr&gt;                                     &lt;dbl&gt; &lt;chr&gt;           &lt;lgl&gt;
+#&gt; 1 NA    "1.\n      The Shawshank Redempt…           9.2 "12345678910\n… NA   
+#&gt; 2 NA    "2.\n      The Godfather\n      …           9.2 "12345678910\n… NA   
+#&gt; 3 NA    "3.\n      The Dark Knight\n    …           9   "12345678910\n… NA   
+#&gt; 4 NA    "4.\n      The Godfather: Part I…           9   "12345678910\n… NA   
+#&gt; 5 NA    "5.\n      12 Angry Men\n       …           9   "12345678910\n… NA   
+#&gt; 6 NA    "6.\n      Schindler's List\n   …           8.9 "12345678910\n… NA   
 #&gt; # … with 244 more rows</pre>
 </div>
 <p>This includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, we’ll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> (instead of <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>) to do the renaming and selecting of just these two columns in one step. Then, we’ll apply <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> (from <a href="#sec-extract-variables" data-type="xref">#sec-extract-variables</a>) to pull out the title, year, and rank into their own variables.</p>
@@ -438,12 +425,12 @@ ratings
  html_elements("td strong") |&gt; 
  head() |&gt; 
  html_attr("title")
-#&gt; [1] "9.2 based on 2,684,096 user ratings"
-#&gt; [2] "9.2 based on 1,861,107 user ratings"
-#&gt; [3] "9.0 based on 2,657,484 user ratings"
-#&gt; [4] "9.0 based on 1,273,669 user ratings"
-#&gt; [5] "9.0 based on 792,941 user ratings"  
-#&gt; [6] "8.9 based on 1,357,901 user ratings"</pre>
+#&gt; [1] "9.2 based on 2,691,480 user ratings"
+#&gt; [2] "9.2 based on 1,867,146 user ratings"
+#&gt; [3] "9.0 based on 2,665,189 user ratings"
+#&gt; [4] "9.0 based on 1,276,943 user ratings"
+#&gt; [5] "9.0 based on 795,129 user ratings"  
+#&gt; [6] "8.9 based on 1,361,148 user ratings"</pre>
 </div>
 <p>We can combine this with the tabular data and again apply <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> to extract out the bit of data we care about:</p>
 <div class="cell">
@@ -465,12 +452,12 @@ ratings
 #&gt; # A tibble: 250 × 5
 #&gt;   rank  title                    year  rating  number
 #&gt;   &lt;chr&gt; &lt;chr&gt;                    &lt;chr&gt;  &lt;dbl&gt;   &lt;dbl&gt;
-#&gt; 1 1     The Shawshank Redemption 1994     9.2 2684096
-#&gt; 2 2     The Godfather            1972     9.2 1861107
-#&gt; 3 3     The Dark Knight          2008     9   2657484
-#&gt; 4 4     The Godfather: Part II   1974     9   1273669
-#&gt; 5 5     12 Angry Men             1957     9    792941
-#&gt; 6 6     Schindler's List         1993     8.9 1357901
+#&gt; 1 1     The Shawshank Redemption 1994     9.2 2691480
+#&gt; 2 2     The Godfather            1972     9.2 1867146
+#&gt; 3 3     The Dark Knight          2008     9   2665189
+#&gt; 4 4     The Godfather: Part II   1974     9   1276943
+#&gt; 5 5     12 Angry Men             1957     9    795129
+#&gt; 6 6     Schindler's List         1993     8.9 1361148
 #&gt; # … with 244 more rows</pre>
 </div>
 </section>
@@ -483,7 +470,7 @@ Dynamic sites</h1>
 <p>It’s still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but it’s something we’re actively working on and should be available by the time you read this. It uses the <a href="https://rstudio.github.io/chromote/index.html">chromote package</a> which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details.</p>
 </section>

-<section id="summary" data-type="sect1">
+<section id="webscraping-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>In this chapter, you’ve learned about the why, the why not, and the how of scraping data from web pages. First, you’ve learned about the basics of HTML and using CSS selectors to refer to specific elements, then you’ve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.</p>