Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -12,12 +12,12 @@ Introduction</h1>
— Hadley Wickham</p>
</blockquote>
<p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
<p>In this chapter, youll first learn the definition of tidy data and see it applied to simple toy dataset. Then well dive into the main tool youll use for tidying data: pivoting. Pivoting allows you to change the form of your data, without changing any of the values. Well finish up with a discussion of usefully untidy data, and how you can create it if needed.</p>
<p>In this chapter, youll first learn the definition of tidy data and see it applied to a simple toy dataset. Then well dive into the primary tool youll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. Well finish with a discussion of usefully untidy data and how you can create it if needed.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
<p>In this chapter, well focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
@@ -28,7 +28,7 @@ Prerequisites</h2>
<section id="sec-tidy-data" data-type="sect1">
<h1>
Tidy data</h1>
<p>You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables: <em>country</em>, <em>year</em>, <em>population</em>, and <em>cases</em> of TB (tuberculosis), but each dataset organizes the values in a different way.</p>
<p>You can represent the same underlying data in multiple ways. The example below shows the same data organized in four different ways. Each dataset shows the same values of four variables: <em>country</em>, <em>year</em>, <em>population</em>, and <em>cases</em> of TB (tuberculosis), but each dataset organizes the values in a different way.</p>
<!-- TODO redraw as tables -->
<div class="cell">
@@ -83,7 +83,7 @@ table4b # population
<p>These are all representations of the same underlying data, but they are not equally easy to use. One of them, <code>table1</code>, will be much easier to work with inside the tidyverse because its tidy.</p>
<p>There are three interrelated rules that make a dataset tidy:</p>
<ol type="1"><li>Each variable is a column; each column is a variable.</li>
<li>Each observation is row; each row is an observation.</li>
<li>Each observation is a row; each row is an observation.</li>
<li>Each value is a cell; each cell is a single value.</li>
</ol><p><a href="#fig-tidy-structure" data-type="xref">#fig-tidy-structure</a> shows the rules visually.</p>
<div class="cell">
@@ -96,8 +96,8 @@ table4b # population
</div>
<p>Why ensure that your data is tidy? There are two main advantages:</p>
<ol type="1"><li><p>Theres a general advantage to picking one consistent way of storing data. If you have a consistent data structure, its easier to learn the tools that work with it because they have an underlying uniformity.</p></li>
<li><p>Theres a specific advantage to placing variables in columns because it allows Rs vectorised nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with <code>table1</code>.</p>
<li><p>Theres a specific advantage to placing variables in columns because it allows Rs vectorized nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with <code>table1</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Compute rate per 10,000
table1 |&gt;
@@ -124,12 +124,12 @@ table1 |&gt;
#&gt; 2 2000 296920
# Visualise changes over time
ggplot(table1, aes(year, cases)) +
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000))</pre>
<div class="cell-output-display">
<p><img src="data-tidy_files/figure-html/unnamed-chunk-5-1.png" alt="This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale." width="480"/></p>
<p><img src="data-tidy_files/figure-html/unnamed-chunk-5-1.png" alt="This figure shows the number of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale." width="480"/></p>
</div>
</div>
@@ -166,15 +166,15 @@ Data in column names</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard
#&gt; # A tibble: 317 × 79
#&gt; artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
#&gt; 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
#&gt; 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
#&gt; 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
#&gt; 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
#&gt; 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
#&gt; # … with 311 more rows, 68 more variables: wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;,
#&gt; artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
#&gt; 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
#&gt; 3 3 Doors… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
#&gt; 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59
#&gt; 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
#&gt; 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
#&gt; # … with 311 more rows, and 68 more variables: wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;,
#&gt; # wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, wk14 &lt;dbl&gt;, wk15 &lt;dbl&gt;, wk16 &lt;dbl&gt;,
#&gt; # wk17 &lt;dbl&gt;, wk18 &lt;dbl&gt;, wk19 &lt;dbl&gt;, wk20 &lt;dbl&gt;, wk21 &lt;dbl&gt;, wk22 &lt;dbl&gt;,
#&gt; # wk23 &lt;dbl&gt;, wk24 &lt;dbl&gt;, wk25 &lt;dbl&gt;, wk26 &lt;dbl&gt;, wk27 &lt;dbl&gt;, wk28 &lt;dbl&gt;,
@@ -261,7 +261,7 @@ billboard_tidy
<p>Now were in a good position to look at how song ranks vary over time by drawing a plot. The code is shown below and the result is <a href="#fig-billboard-ranks" data-type="xref">#fig-billboard-ranks</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard_tidy |&gt;
ggplot(aes(week, rank, group = track)) +
ggplot(aes(x = week, y = rank, group = track)) +
geom_line(alpha = 1/3) +
scale_y_reverse()</pre>
<div class="cell-output-display">
@@ -339,21 +339,21 @@ Many variables in column names</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">who2
#&gt; # A tibble: 7,240 × 58
#&gt; country year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1980 NA NA NA NA NA NA NA
#&gt; 2 Afghanistan 1981 NA NA NA NA NA NA NA
#&gt; 3 Afghanistan 1982 NA NA NA NA NA NA NA
#&gt; 4 Afghanistan 1983 NA NA NA NA NA NA NA
#&gt; 5 Afghanistan 1984 NA NA NA NA NA NA NA
#&gt; 6 Afghanistan 1985 NA NA NA NA NA NA NA
#&gt; # … with 7,234 more rows, 49 more variables: sp_f_014 &lt;dbl&gt;,
#&gt; # sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, sp_f_3544 &lt;dbl&gt;, sp_f_4554 &lt;dbl&gt;,
#&gt; # sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;, sn_m_014 &lt;dbl&gt;, sn_m_1524 &lt;dbl&gt;,
#&gt; # sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;, sn_m_4554 &lt;dbl&gt;, sn_m_5564 &lt;dbl&gt;,
#&gt; # sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;, sn_f_1524 &lt;dbl&gt;, sn_f_2534 &lt;dbl&gt;,
#&gt; # sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;, sn_f_5564 &lt;dbl&gt;, sn_f_65 &lt;dbl&gt;,
#&gt; # ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;, ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;,</pre>
#&gt; country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanist 1980 NA NA NA NA NA NA
#&gt; 2 Afghanist 1981 NA NA NA NA NA NA
#&gt; 3 Afghanist 1982 NA NA NA NA NA NA
#&gt; 4 Afghanist 1983 NA NA NA NA NA NA
#&gt; 5 Afghanist 1984 NA NA NA NA NA NA
#&gt; 6 Afghanist 1985 NA NA NA NA NA NA
#&gt; # … with 7,234 more rows, and 50 more variables: sp_m_65 &lt;dbl&gt;,
#&gt; # sp_f_014 &lt;dbl&gt;, sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, sp_f_3544 &lt;dbl&gt;,
#&gt; # sp_f_4554 &lt;dbl&gt;, sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;, sn_m_014 &lt;dbl&gt;,
#&gt; # sn_m_1524 &lt;dbl&gt;, sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;, sn_m_4554 &lt;dbl&gt;,
#&gt; # sn_m_5564 &lt;dbl&gt;, sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;, sn_f_1524 &lt;dbl&gt;,
#&gt; # sn_f_2534 &lt;dbl&gt;, sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;, sn_f_5564 &lt;dbl&gt;,
#&gt; # sn_f_65 &lt;dbl&gt;, ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;, ep_m_2534 &lt;dbl&gt;, …</pre>
</div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
@@ -446,15 +446,15 @@ Widening data</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_experience
#&gt; # A tibble: 500 × 5
#&gt; org_pac_id org_nm measure_cd measure_title prf_r…¹
#&gt; org_pac_id org_nm measure_cd measure_title prf_rate
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS … 63
#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS … 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS … 86
#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS … 57
#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS … 85
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS … 24
#&gt; # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS… 63
#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS… 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS… 86
#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS… 57
#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS… 85
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS… 24
#&gt; # … with 494 more rows</pre>
</div>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
<div class="cell">
@@ -479,17 +479,16 @@ Widening data</h2>
values_from = prf_rate
)
#&gt; # A tibble: 500 × 9
#&gt; org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CAR… CAHPS … 63 NA NA NA NA NA
#&gt; 2 0446157747 USC CAR… CAHPS … NA 87 NA NA NA NA
#&gt; 3 0446157747 USC CAR… CAHPS … NA NA 86 NA NA NA
#&gt; 4 0446157747 USC CAR… CAHPS … NA NA NA 57 NA NA
#&gt; 5 0446157747 USC CAR… CAHPS … NA NA NA NA 85 NA
#&gt; 6 0446157747 USC CAR… CAHPS … NA NA NA NA NA 24
#&gt; # … with 494 more rows, and abbreviated variable names ¹measure_title,
#&gt; # ²​CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
#&gt; # ⁷CAHPS_GRP_12</pre>
#&gt; org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDI… CAHPS for MI63 NA NA
#&gt; 2 0446157747 USC CARE MEDI… CAHPS for MI… NA 87 NA
#&gt; 3 0446157747 USC CARE MEDI… CAHPS for MI… NA NA 86
#&gt; 4 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
#&gt; 5 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
#&gt; 6 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
#&gt; # … with 494 more rows, and 3 more variables: CAHPS_GRP_5 &lt;dbl&gt;,
#&gt; # CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;</pre>
</div>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<div class="cell">
@@ -500,16 +499,16 @@ Widening data</h2>
values_from = prf_rate
)
#&gt; # A tibble: 95 × 8
#&gt; org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICA… 63 87 86 57 85 24
#&gt; 2 0446162697 ASSOCIATION OF … 59 85 83 63 88 22
#&gt; 3 0547164295 BEAVER MEDICAL … 49 NA 75 44 73 12
#&gt; 4 0749333730 CAPE PHYSICIANS… 67 84 85 65 82 24
#&gt; 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64 87 28
#&gt; 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
#&gt; # … with 89 more rows, and abbreviated variable names ¹CAHPS_GRP_1,
#&gt; # ²CAHPS_GRP_2, ³CAHPS_GRP_3, ⁴CAHPS_GRP_5, ⁵CAHPS_GRP_8, ⁶CAHPS_GRP_12</pre>
#&gt; org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICA… 63 87 86 57
#&gt; 2 0446162697 ASSOCIATION OF … 59 85 83 63
#&gt; 3 0547164295 BEAVER MEDICAL … 49 NA 75 44
#&gt; 4 0749333730 CAPE PHYSICIANS… 67 84 85 65
#&gt; 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64
#&gt; 6 0840109864 REX HOSPITAL INC 73 87 84 67
#&gt; # … with 89 more rows, and 2 more variables: CAHPS_GRP_8 &lt;dbl&gt;,
#&gt; # CAHPS_GRP_12 &lt;dbl&gt;</pre>
</div>
<p>This gives us the output that were looking for.</p>
</section>
@@ -826,7 +825,7 @@ Pragmatic computation</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_care |&gt;
filter(type == "observed") |&gt;
ggplot(aes(score)) +
ggplot(aes(x = score)) +
geom_histogram(binwidth = 2) +
facet_wrap(vars(measure_abbr))
#&gt; Warning: Removed 1 rows containing non-finite values (`stat_bin()`).</pre>
@@ -842,7 +841,7 @@ Pragmatic computation</h2>
names_from = measure_abbr,
values_from = score
) |&gt;
ggplot(aes(dyspnea_screening, dyspena_treatment)) +
ggplot(aes(x = dyspnea_screening, y = dyspena_treatment)) +
geom_point() +
coord_equal()</pre>
</div>