355 lines
7.6 KiB
Plaintext
355 lines
7.6 KiB
Plaintext
|
---
|
||
|
title: "Data tidy"
|
||
|
subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution
|
||
|
author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心
|
||
|
date: today
|
||
|
lang: zh
|
||
|
format:
|
||
|
revealjs:
|
||
|
theme: dark
|
||
|
slide-number: true
|
||
|
chalkboard:
|
||
|
buttons: true
|
||
|
preview-links: auto
|
||
|
lang: zh
|
||
|
toc: true
|
||
|
toc-depth: 1
|
||
|
toc-title: 大纲
|
||
|
logo: ./_extensions/inst/img/ucaslogo.png
|
||
|
css: ./_extensions/inst/css/revealjs.css
|
||
|
pointer:
|
||
|
key: "p"
|
||
|
color: "#32cd32"
|
||
|
pointerSize: 18
|
||
|
revealjs-plugins:
|
||
|
- pointer
|
||
|
filters:
|
||
|
- d2
|
||
|
---
|
||
|
|
||
|
```{r}
|
||
|
#| echo: false
|
||
|
knitr::opts_chunk$set(echo = TRUE)
|
||
|
source("../../coding/_common.R")
|
||
|
library(tidyverse)
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
## tidy data
|
||
|
|
||
|
```{r}
|
||
|
#| label: fig-tidy-structure
|
||
|
#| echo: false
|
||
|
#| fig-cap: |
|
||
|
#| The following three rules make a dataset tidy: variables are columns,
|
||
|
#| observations are rows, and values are cells.
|
||
|
#| fig-alt: |
|
||
|
#| Three panels, each representing a tidy data frame. The first panel
|
||
|
#| shows that each variable is a column. The second panel shows that each
|
||
|
#| observation is a row. The third panel shows that each value is
|
||
|
#| a cell.
|
||
|
|
||
|
knitr::include_graphics("../../image/tidy-1.png", dpi = 270)
|
||
|
```
|
||
|
|
||
|
## 简单计算
|
||
|
|
||
|
```{r}
|
||
|
# Compute rate per 10,000
|
||
|
table1 |>
|
||
|
mutate(rate = cases / population * 10000)
|
||
|
|
||
|
|
||
|
```
|
||
|
|
||
|
## 简单计算
|
||
|
|
||
|
```{r}
|
||
|
# Compute total cases per year
|
||
|
table1 |>
|
||
|
group_by(year) |>
|
||
|
summarize(total_cases = sum(cases))
|
||
|
|
||
|
```
|
||
|
|
||
|
## 可视化
|
||
|
|
||
|
```{r}
|
||
|
#| fig-width: 5
|
||
|
#| fig-alt: |
|
||
|
#| This figure shows the number of cases in 1999 and 2000 for
|
||
|
#| Afghanistan, Brazil, and China, with year on the x-axis and number
|
||
|
#| of cases on the y-axis. Each point on the plot represents the number
|
||
|
#| of cases in a given country in a given year. The points for each
|
||
|
#| country are differentiated from others by color and shape and connected
|
||
|
#| with a line, resulting in three, non-parallel, non-intersecting lines.
|
||
|
#| The numbers of cases in China are highest for both 1999 and 2000, with
|
||
|
#| values above 200,000 for both years. The number of cases in Brazil is
|
||
|
#| approximately 40,000 in 1999 and approximately 75,000 in 2000. The
|
||
|
#| numbers of cases in Afghanistan are lowest for both 1999 and 2000, with
|
||
|
#| values that appear to be very close to 0 on this scale.
|
||
|
|
||
|
|
||
|
# Visualize changes over time
|
||
|
ggplot(table1, aes(x = year, y = cases)) +
|
||
|
geom_line(aes(group = country), color = "grey50") +
|
||
|
geom_point(aes(color = country, shape = country)) +
|
||
|
scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000
|
||
|
```
|
||
|
|
||
|
## 查看数据
|
||
|
|
||
|
```{r}
|
||
|
billboard
|
||
|
```
|
||
|
|
||
|
|
||
|
## 数据变形
|
||
|
|
||
|
```{r}
|
||
|
billboard |>
|
||
|
pivot_longer(
|
||
|
cols = starts_with("wk"),
|
||
|
names_to = "week",
|
||
|
values_to = "rank",
|
||
|
values_drop_na = TRUE
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 数据变形
|
||
|
|
||
|
```{r}
|
||
|
billboard_longer <- billboard |>
|
||
|
pivot_longer(
|
||
|
cols = starts_with("wk"),
|
||
|
names_to = "week",
|
||
|
values_to = "rank",
|
||
|
values_drop_na = TRUE
|
||
|
) |>
|
||
|
mutate(
|
||
|
week = parse_number(week)
|
||
|
)
|
||
|
billboard_longer
|
||
|
```
|
||
|
|
||
|
## 可视化
|
||
|
|
||
|
```{r}
|
||
|
#| label: fig-billboard-ranks
|
||
|
#| fig-cap: |
|
||
|
#| A line plot showing how the rank of a song changes over time.
|
||
|
#| fig-alt: |
|
||
|
#| A line plot with week on the x-axis and rank on the y-axis, where
|
||
|
#| each line represents a song. Most songs appear to start at a high rank,
|
||
|
#| rapidly accelerate to a low rank, and then decay again. There are
|
||
|
#| surprisingly few tracks in the region when week is >20 and rank is
|
||
|
#| >50.
|
||
|
|
||
|
billboard_longer |>
|
||
|
ggplot(aes(x = week, y = rank, group = track)) +
|
||
|
geom_line(alpha = 0.25) +
|
||
|
scale_y_reverse()
|
||
|
```
|
||
|
|
||
|
|
||
|
## 练习
|
||
|
|
||
|
```{r}
|
||
|
df <- tribble(
|
||
|
~id, ~bp1, ~bp2,
|
||
|
"A", 100, 120,
|
||
|
"B", 140, 115,
|
||
|
"C", 120, 125
|
||
|
)
|
||
|
df |>
|
||
|
pivot_longer(
|
||
|
cols = bp1:bp2,
|
||
|
names_to = "measurement",
|
||
|
values_to = "value"
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 变形示意图
|
||
|
|
||
|
```{r}
|
||
|
#| label: fig-pivot-variables
|
||
|
#| echo: false
|
||
|
#| fig-cap: |
|
||
|
#| Columns that are already variables need to be repeated, once for
|
||
|
#| each column that is pivoted.
|
||
|
#| fig-alt: |
|
||
|
#| A diagram showing how `pivot_longer()` transforms a simple
|
||
|
#| dataset, using color to highlight how the values in the `id` column
|
||
|
#| ("A", "B", "C") are each repeated twice in the output because there are
|
||
|
#| two columns being pivoted ("bp1" and "bp2").
|
||
|
|
||
|
knitr::include_graphics("../../image/tidy-data/variables.png", dpi = 270)
|
||
|
```
|
||
|
|
||
|
## 查看数据
|
||
|
|
||
|
```{r}
|
||
|
who2
|
||
|
```
|
||
|
|
||
|
## 数据变形
|
||
|
|
||
|
```{r}
|
||
|
who2 |>
|
||
|
pivot_longer(
|
||
|
cols = !(country:year),
|
||
|
names_to = c("diagnosis", "gender", "age"),
|
||
|
names_sep = "_",
|
||
|
values_to = "count"
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 变形示意图
|
||
|
|
||
|
```{r}
|
||
|
#| label: fig-pivot-multiple-names
|
||
|
#| echo: false
|
||
|
#| fig-cap: |
|
||
|
#| Pivoting columns with multiple pieces of information in the names
|
||
|
#| means that each column name now fills in values in multiple output
|
||
|
#| columns.
|
||
|
#| fig-alt: |
|
||
|
#| A diagram that uses color to illustrate how supplying `names_sep`
|
||
|
#| and multiple `names_to` creates multiple variables in the output.
|
||
|
#| The input has variable names "x_1" and "y_2" which are split up
|
||
|
#| by "_" to create name and number columns in the output. This is
|
||
|
#| is similar case with a single `names_to`, but what would have been a
|
||
|
#| single output variable is now separated into multiple variables.
|
||
|
|
||
|
knitr::include_graphics("../../image/tidy-data/multiple-names.png", dpi = 270)
|
||
|
```
|
||
|
|
||
|
## 查看数据
|
||
|
|
||
|
```{r}
|
||
|
household
|
||
|
```
|
||
|
|
||
|
## 数据变形
|
||
|
|
||
|
```{r}
|
||
|
household |>
|
||
|
pivot_longer(
|
||
|
cols = !family,
|
||
|
names_to = c(".value", "child"),
|
||
|
names_sep = "_",
|
||
|
values_drop_na = TRUE
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 变形示意图
|
||
|
|
||
|
```{r}
|
||
|
#| label: fig-pivot-names-and-values
|
||
|
#| echo: false
|
||
|
#| fig-cap: |
|
||
|
#| Pivoting with `names_to = c(".value", "num")` splits the column names
|
||
|
#| into two components: the first part determines the output column
|
||
|
#| name (`x` or `y`), and the second part determines the value of the
|
||
|
#| `num` column.
|
||
|
#| fig-alt: |
|
||
|
#| A diagram that uses color to illustrate how the special ".value"
|
||
|
#| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
|
||
|
#| and we want to use the first component ("x", "y") as a variable name
|
||
|
#| and the second ("1", "2") as the value for a new "num" column.
|
||
|
|
||
|
knitr::include_graphics("../../image/tidy-data/names-and-values.png", dpi = 270)
|
||
|
```
|
||
|
|
||
|
## 查看数据
|
||
|
|
||
|
```{r}
|
||
|
cms_patient_experience
|
||
|
cms_patient_experience |>
|
||
|
distinct(measure_cd, measure_title)
|
||
|
```
|
||
|
|
||
|
## 数据变形(变宽)
|
||
|
|
||
|
```{r}
|
||
|
cms_patient_experience |>
|
||
|
pivot_wider(
|
||
|
names_from = measure_cd,
|
||
|
values_from = prf_rate
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 数据变形(变宽)
|
||
|
|
||
|
```{r}
|
||
|
cms_patient_experience |>
|
||
|
pivot_wider(
|
||
|
id_cols = starts_with("org"),
|
||
|
names_from = measure_cd,
|
||
|
values_from = prf_rate
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 练习
|
||
|
|
||
|
```{r}
|
||
|
df <- tribble(
|
||
|
~id, ~measurement, ~value,
|
||
|
"A", "bp1", 100,
|
||
|
"B", "bp1", 140,
|
||
|
"B", "bp2", 115,
|
||
|
"A", "bp2", 120,
|
||
|
"A", "bp3", 105
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 练习
|
||
|
|
||
|
```{r}
|
||
|
df |>
|
||
|
pivot_wider(
|
||
|
names_from = measurement,
|
||
|
values_from = value
|
||
|
)
|
||
|
```
|
||
|
|
||
|
|
||
|
## 练习
|
||
|
|
||
|
```{r}
|
||
|
df <- tribble(
|
||
|
~id, ~measurement, ~value,
|
||
|
"A", "bp1", 100,
|
||
|
"A", "bp1", 102,
|
||
|
"A", "bp2", 120,
|
||
|
"B", "bp1", 140,
|
||
|
"B", "bp2", 115
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 练习
|
||
|
|
||
|
```{r}
|
||
|
df |>
|
||
|
pivot_wider(
|
||
|
names_from = measurement,
|
||
|
values_from = value
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## 练习
|
||
|
|
||
|
```{r}
|
||
|
df |>
|
||
|
group_by(id, measurement) |>
|
||
|
summarize(n = n(), .groups = "drop") |>
|
||
|
filter(n > 1)
|
||
|
```
|
||
|
|
||
|
## 欢迎讨论!{.center}
|
||
|
|
||
|
|
||
|
`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drwater.rcees.ac.cn/course/public/RWEP/@PUB/SD/")`
|
||
|
|