RWEP/SD/20240326_4_datatidy/index.qmd

355 lines
7.6 KiB
Plaintext

---
title: "Data tidy"
subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution
author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心
date: today
lang: zh
format:
revealjs:
theme: dark
slide-number: true
chalkboard:
buttons: true
preview-links: auto
lang: zh
toc: true
toc-depth: 1
toc-title: 大纲
logo: ./_extensions/inst/img/ucaslogo.png
css: ./_extensions/inst/css/revealjs.css
pointer:
key: "p"
color: "#32cd32"
pointerSize: 18
revealjs-plugins:
- pointer
filters:
- d2
---
```{r}
#| echo: false
knitr::opts_chunk$set(echo = TRUE)
source("../../coding/_common.R")
library(tidyverse)
```
## tidy data
```{r}
#| label: fig-tidy-structure
#| echo: false
#| fig-cap: |
#| The following three rules make a dataset tidy: variables are columns,
#| observations are rows, and values are cells.
#| fig-alt: |
#| Three panels, each representing a tidy data frame. The first panel
#| shows that each variable is a column. The second panel shows that each
#| observation is a row. The third panel shows that each value is
#| a cell.
knitr::include_graphics("../../image/tidy-1.png", dpi = 270)
```
## 简单计算
```{r}
# Compute rate per 10,000
table1 |>
mutate(rate = cases / population * 10000)
```
## 简单计算
```{r}
# Compute total cases per year
table1 |>
group_by(year) |>
summarize(total_cases = sum(cases))
```
## 可视化
```{r}
#| fig-width: 5
#| fig-alt: |
#| This figure shows the number of cases in 1999 and 2000 for
#| Afghanistan, Brazil, and China, with year on the x-axis and number
#| of cases on the y-axis. Each point on the plot represents the number
#| of cases in a given country in a given year. The points for each
#| country are differentiated from others by color and shape and connected
#| with a line, resulting in three, non-parallel, non-intersecting lines.
#| The numbers of cases in China are highest for both 1999 and 2000, with
#| values above 200,000 for both years. The number of cases in Brazil is
#| approximately 40,000 in 1999 and approximately 75,000 in 2000. The
#| numbers of cases in Afghanistan are lowest for both 1999 and 2000, with
#| values that appear to be very close to 0 on this scale.
# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000
```
## 查看数据
```{r}
billboard
```
## 数据变形
```{r}
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
```
## 数据变形
```{r}
billboard_longer <- billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) |>
mutate(
week = parse_number(week)
)
billboard_longer
```
## 可视化
```{r}
#| label: fig-billboard-ranks
#| fig-cap: |
#| A line plot showing how the rank of a song changes over time.
#| fig-alt: |
#| A line plot with week on the x-axis and rank on the y-axis, where
#| each line represents a song. Most songs appear to start at a high rank,
#| rapidly accelerate to a low rank, and then decay again. There are
#| surprisingly few tracks in the region when week is >20 and rank is
#| >50.
billboard_longer |>
ggplot(aes(x = week, y = rank, group = track)) +
geom_line(alpha = 0.25) +
scale_y_reverse()
```
## 练习
```{r}
df <- tribble(
~id, ~bp1, ~bp2,
"A", 100, 120,
"B", 140, 115,
"C", 120, 125
)
df |>
pivot_longer(
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)
```
## 变形示意图
```{r}
#| label: fig-pivot-variables
#| echo: false
#| fig-cap: |
#| Columns that are already variables need to be repeated, once for
#| each column that is pivoted.
#| fig-alt: |
#| A diagram showing how `pivot_longer()` transforms a simple
#| dataset, using color to highlight how the values in the `id` column
#| ("A", "B", "C") are each repeated twice in the output because there are
#| two columns being pivoted ("bp1" and "bp2").
knitr::include_graphics("../../image/tidy-data/variables.png", dpi = 270)
```
## 查看数据
```{r}
who2
```
## 数据变形
```{r}
who2 |>
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
```
## 变形示意图
```{r}
#| label: fig-pivot-multiple-names
#| echo: false
#| fig-cap: |
#| Pivoting columns with multiple pieces of information in the names
#| means that each column name now fills in values in multiple output
#| columns.
#| fig-alt: |
#| A diagram that uses color to illustrate how supplying `names_sep`
#| and multiple `names_to` creates multiple variables in the output.
#| The input has variable names "x_1" and "y_2" which are split up
#| by "_" to create name and number columns in the output. This is
#| is similar case with a single `names_to`, but what would have been a
#| single output variable is now separated into multiple variables.
knitr::include_graphics("../../image/tidy-data/multiple-names.png", dpi = 270)
```
## 查看数据
```{r}
household
```
## 数据变形
```{r}
household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
```
## 变形示意图
```{r}
#| label: fig-pivot-names-and-values
#| echo: false
#| fig-cap: |
#| Pivoting with `names_to = c(".value", "num")` splits the column names
#| into two components: the first part determines the output column
#| name (`x` or `y`), and the second part determines the value of the
#| `num` column.
#| fig-alt: |
#| A diagram that uses color to illustrate how the special ".value"
#| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
#| and we want to use the first component ("x", "y") as a variable name
#| and the second ("1", "2") as the value for a new "num" column.
knitr::include_graphics("../../image/tidy-data/names-and-values.png", dpi = 270)
```
## 查看数据
```{r}
cms_patient_experience
cms_patient_experience |>
distinct(measure_cd, measure_title)
```
## 数据变形(变宽)
```{r}
cms_patient_experience |>
pivot_wider(
names_from = measure_cd,
values_from = prf_rate
)
```
## 数据变形(变宽)
```{r}
cms_patient_experience |>
pivot_wider(
id_cols = starts_with("org"),
names_from = measure_cd,
values_from = prf_rate
)
```
## 练习
```{r}
df <- tribble(
~id, ~measurement, ~value,
"A", "bp1", 100,
"B", "bp1", 140,
"B", "bp2", 115,
"A", "bp2", 120,
"A", "bp3", 105
)
```
## 练习
```{r}
df |>
pivot_wider(
names_from = measurement,
values_from = value
)
```
## 练习
```{r}
df <- tribble(
~id, ~measurement, ~value,
"A", "bp1", 100,
"A", "bp1", 102,
"A", "bp2", 120,
"B", "bp1", 140,
"B", "bp2", 115
)
```
## 练习
```{r}
df |>
pivot_wider(
names_from = measurement,
values_from = value
)
```
## 练习
```{r}
df |>
group_by(id, measurement) |>
summarize(n = n(), .groups = "drop") |>
filter(n > 1)
```
## 欢迎讨论!{.center}
`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drwater.rcees.ac.cn/course/public/RWEP/@PUB/SD/")`