RWEP/SD/20240326_4_datatidy/index.qmd

---
title: "Data tidy"
subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution
author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心
date: today
lang: zh
format:
  revealjs:
    theme: dark
    slide-number: true
    chalkboard:
      buttons: true
    preview-links: auto
    lang: zh
    toc: true
    toc-depth: 1
    toc-title: 大纲
    logo: ./_extensions/inst/img/ucaslogo.png
    css: ./_extensions/inst/css/revealjs.css
    pointer:
      key: "p"
      color: "#32cd32"
      pointerSize: 18
revealjs-plugins:
  - pointer
filters:
  - d2
---

```{r}
#| echo: false
knitr::opts_chunk$set(echo = TRUE)
source("../../coding/_common.R")
library(tidyverse)
```


## tidy data

```{r}
#| label: fig-tidy-structure
#| echo: false
#| fig-cap: | 
#|   The following three rules make a dataset tidy: variables are columns,
#|   observations are rows, and values are cells.
#| fig-alt: | 
#|   Three panels, each representing a tidy data frame. The first panel
#|   shows that each variable is a column. The second panel shows that each
#|   observation is a row. The third panel shows that each value is
#|   a cell.

knitr::include_graphics("../../image/tidy-1.png", dpi = 270)
```

##  简单计算

```{r}
# Compute rate per 10,000
table1 |>
  mutate(rate = cases / population * 10000)


```

##  简单计算

```{r}
# Compute total cases per year
table1 |> 
  group_by(year) |> 
  summarize(total_cases = sum(cases))

```

##  可视化

```{r}
#| fig-width: 5
#| fig-alt: |
#|   This figure shows the number of cases in 1999 and 2000 for 
#|   Afghanistan, Brazil, and China, with year on the x-axis and number 
#|   of cases on the y-axis. Each point on the plot represents the number 
#|   of cases in a given country in a given year. The points for each
#|   country are differentiated from others by color and shape and connected
#|   with a line, resulting in three, non-parallel, non-intersecting lines.
#|   The numbers of cases in China are highest for both 1999 and 2000, with
#|   values above 200,000 for both years. The number of cases in Brazil is
#|   approximately 40,000 in 1999 and approximately 75,000 in 2000. The
#|   numbers of cases in Afghanistan are lowest for both 1999 and 2000, with
#|   values that appear to be very close to 0 on this scale.


# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000
```

## 查看数据

```{r}
billboard
```


## 数据变形

```{r}
billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )
```

## 数据变形

```{r}
billboard_longer <- billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  ) |> 
  mutate(
    week = parse_number(week)
  )
billboard_longer
```

## 可视化

```{r}
#| label: fig-billboard-ranks
#| fig-cap: |
#|   A line plot showing how the rank of a song changes over time.
#| fig-alt: |
#|   A line plot with week on the x-axis and rank on the y-axis, where
#|   each line represents a song. Most songs appear to start at a high rank,
#|   rapidly accelerate to a low rank, and then decay again. There are
#|   surprisingly few tracks in the region when week is >20 and rank is
#|   >50.

billboard_longer |> 
  ggplot(aes(x = week, y = rank, group = track)) + 
  geom_line(alpha = 0.25) + 
  scale_y_reverse()
```


## 练习

```{r}
df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)
df |> 
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
```

## 变形示意图

```{r}
#| label: fig-pivot-variables
#| echo: false
#| fig-cap: | 
#|   Columns that are already variables need to be repeated, once for
#|   each column that is pivoted.
#| fig-alt: | 
#|   A diagram showing how `pivot_longer()` transforms a simple
#|   dataset, using color to highlight how the values in the `id` column
#|   ("A", "B", "C") are each repeated twice in the output because there are
#|   two columns being pivoted ("bp1" and "bp2").

knitr::include_graphics("../../image/tidy-data/variables.png", dpi = 270)
```

## 查看数据

```{r}
who2
```

## 数据变形

```{r}
who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  )
```

## 变形示意图

```{r}
#| label: fig-pivot-multiple-names
#| echo: false
#| fig-cap: |
#|   Pivoting columns with multiple pieces of information in the names 
#|   means that each column name now fills in values in multiple output 
#|   columns.
#| fig-alt: |
#|   A diagram that uses color to illustrate how supplying `names_sep` 
#|   and multiple `names_to` creates multiple variables in the output.
#|   The input has variable names "x_1" and "y_2" which are split up
#|   by "_" to create name and number columns in the output. This is
#|   is similar case with a single `names_to`, but what would have been a
#|   single output variable is now separated into multiple variables.

knitr::include_graphics("../../image/tidy-data/multiple-names.png", dpi = 270)
```

## 查看数据

```{r}
household
```

## 数据变形

```{r}
household |> 
  pivot_longer(
    cols = !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = TRUE
  )
```

## 变形示意图

```{r}
#| label: fig-pivot-names-and-values
#| echo: false
#| fig-cap: |
#|   Pivoting with `names_to = c(".value", "num")` splits the column names
#|   into two components: the first part determines the output column
#|   name (`x` or `y`), and the second part determines the value of the
#|   `num` column.
#| fig-alt: |
#|   A diagram that uses color to illustrate how the special ".value"
#|   sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
#|   and we want to use the first component ("x", "y") as a variable name
#|   and the second ("1", "2") as the value for a new "num" column.

knitr::include_graphics("../../image/tidy-data/names-and-values.png", dpi = 270)
```

## 查看数据

```{r}
cms_patient_experience
cms_patient_experience |> 
  distinct(measure_cd, measure_title)
```

## 数据变形（变宽）

```{r}
cms_patient_experience |> 
  pivot_wider(
    names_from = measure_cd,
    values_from = prf_rate
  )
```

## 数据变形（变宽）

```{r}
cms_patient_experience |> 
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
  )
```

## 练习

```{r}
df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)
```

## 练习

```{r}
df |> 
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
```


## 练习

```{r}
df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "A",        "bp1",    102,
  "A",        "bp2",    120,
  "B",        "bp1",    140, 
  "B",        "bp2",    115
)
```

## 练习

```{r}
df |>
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
```

## 练习

```{r}
df |> 
  group_by(id, measurement) |> 
  summarize(n = n(), .groups = "drop") |> 
  filter(n > 1)
```

## 欢迎讨论！{.center}


`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drwater.rcees.ac.cn/course/public/RWEP/@PUB/SD/")`
准备第7次课 2024-03-21 22:30:54 +08:00			`---`
			`title: "Data tidy"`
			`subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution`
			`author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心`
			`date: today`
			`lang: zh`
			`format:`
			`revealjs:`
			`theme: dark`
			`slide-number: true`
			`chalkboard:`
			`buttons: true`
			`preview-links: auto`
			`lang: zh`
			`toc: true`
			`toc-depth: 1`
			`toc-title: 大纲`
			`logo: ./_extensions/inst/img/ucaslogo.png`
			`css: ./_extensions/inst/css/revealjs.css`
			`pointer:`
			`key: "p"`
			`color: "#32cd32"`
			`pointerSize: 18`
			`revealjs-plugins:`
			`- pointer`
			`filters:`
			`- d2`
			`---`

			```{r}
			`#\| echo: false`
			`knitr::opts_chunk$set(echo = TRUE)`
			`source("../../coding/_common.R")`
			`library(tidyverse)`
			```



			`## tidy data`

			```{r}
			`#\| label: fig-tidy-structure`
			`#\| echo: false`
			`#\| fig-cap: \|`
			`#\| The following three rules make a dataset tidy: variables are columns,`
			`#\| observations are rows, and values are cells.`
			`#\| fig-alt: \|`
			`#\| Three panels, each representing a tidy data frame. The first panel`
			`#\| shows that each variable is a column. The second panel shows that each`
			`#\| observation is a row. The third panel shows that each value is`
			`#\| a cell.`

			`knitr::include_graphics("../../image/tidy-1.png", dpi = 270)`
			```

			`## 简单计算`

			```{r}
			`# Compute rate per 10,000`
			`table1 \|>`
			`mutate(rate = cases / population * 10000)`


			```

			`## 简单计算`

			```{r}
			`# Compute total cases per year`
			`table1 \|>`
			`group_by(year) \|>`
			`summarize(total_cases = sum(cases))`

			```

			`## 可视化`

			```{r}
			`#\| fig-width: 5`
			`#\| fig-alt: \|`
			`#\| This figure shows the number of cases in 1999 and 2000 for`
			`#\| Afghanistan, Brazil, and China, with year on the x-axis and number`
			`#\| of cases on the y-axis. Each point on the plot represents the number`
			`#\| of cases in a given country in a given year. The points for each`
			`#\| country are differentiated from others by color and shape and connected`
			`#\| with a line, resulting in three, non-parallel, non-intersecting lines.`
			`#\| The numbers of cases in China are highest for both 1999 and 2000, with`
			`#\| values above 200,000 for both years. The number of cases in Brazil is`
			`#\| approximately 40,000 in 1999 and approximately 75,000 in 2000. The`
			`#\| numbers of cases in Afghanistan are lowest for both 1999 and 2000, with`
			`#\| values that appear to be very close to 0 on this scale.`


			`# Visualize changes over time`
			`ggplot(table1, aes(x = year, y = cases)) +`
			`geom_line(aes(group = country), color = "grey50") +`
			`geom_point(aes(color = country, shape = country)) +`
			`scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000`
			```

			`## 查看数据`

			```{r}
			`billboard`
			```


			`## 数据变形`

			```{r}
			`billboard \|>`
			`pivot_longer(`
			`cols = starts_with("wk"),`
			`names_to = "week",`
			`values_to = "rank",`
			`values_drop_na = TRUE`
			`)`
			```

			`## 数据变形`

			```{r}
			`billboard_longer <- billboard \|>`
			`pivot_longer(`
			`cols = starts_with("wk"),`
			`names_to = "week",`
			`values_to = "rank",`
			`values_drop_na = TRUE`
			`) \|>`
			`mutate(`
			`week = parse_number(week)`
			`)`
			`billboard_longer`
			```

			`## 可视化`

			```{r}
			`#\| label: fig-billboard-ranks`
			`#\| fig-cap: \|`
			`#\| A line plot showing how the rank of a song changes over time.`
			`#\| fig-alt: \|`
			`#\| A line plot with week on the x-axis and rank on the y-axis, where`
			`#\| each line represents a song. Most songs appear to start at a high rank,`
			`#\| rapidly accelerate to a low rank, and then decay again. There are`
			`#\| surprisingly few tracks in the region when week is >20 and rank is`
			`#\| >50.`

			`billboard_longer \|>`
			`ggplot(aes(x = week, y = rank, group = track)) +`
			`geom_line(alpha = 0.25) +`
			`scale_y_reverse()`
			```


			`## 练习`

			```{r}
			`df <- tribble(`
			`~id, ~bp1, ~bp2,`
			`"A", 100, 120,`
			`"B", 140, 115,`
			`"C", 120, 125`
			`)`
			`df \|>`
			`pivot_longer(`
			`cols = bp1:bp2,`
			`names_to = "measurement",`
			`values_to = "value"`
			`)`
			```

			`## 变形示意图`

			```{r}
			`#\| label: fig-pivot-variables`
			`#\| echo: false`
			`#\| fig-cap: \|`
			`#\| Columns that are already variables need to be repeated, once for`
			`#\| each column that is pivoted.`
			`#\| fig-alt: \|`
			#\| A diagram showing how `pivot_longer()` transforms a simple
			#\| dataset, using color to highlight how the values in the `id` column
			`#\| ("A", "B", "C") are each repeated twice in the output because there are`
			`#\| two columns being pivoted ("bp1" and "bp2").`

			`knitr::include_graphics("../../image/tidy-data/variables.png", dpi = 270)`
			```

			`## 查看数据`

			```{r}
			`who2`
			```

			`## 数据变形`

			```{r}
			`who2 \|>`
			`pivot_longer(`
			`cols = !(country:year),`
			`names_to = c("diagnosis", "gender", "age"),`
			`names_sep = "_",`
			`values_to = "count"`
			`)`
			```

			`## 变形示意图`

			```{r}
			`#\| label: fig-pivot-multiple-names`
			`#\| echo: false`
			`#\| fig-cap: \|`
			`#\| Pivoting columns with multiple pieces of information in the names`
			`#\| means that each column name now fills in values in multiple output`
			`#\| columns.`
			`#\| fig-alt: \|`
			#\| A diagram that uses color to illustrate how supplying `names_sep`
			#\| and multiple `names_to` creates multiple variables in the output.
			`#\| The input has variable names "x_1" and "y_2" which are split up`
			`#\| by "_" to create name and number columns in the output. This is`
			#\| is similar case with a single `names_to`, but what would have been a
			`#\| single output variable is now separated into multiple variables.`

			`knitr::include_graphics("../../image/tidy-data/multiple-names.png", dpi = 270)`
			```

			`## 查看数据`

			```{r}
			`household`
			```

			`## 数据变形`

			```{r}
			`household \|>`
			`pivot_longer(`
			`cols = !family,`
			`names_to = c(".value", "child"),`
			`names_sep = "_",`
			`values_drop_na = TRUE`
			`)`
			```

			`## 变形示意图`

			```{r}
			`#\| label: fig-pivot-names-and-values`
			`#\| echo: false`
			`#\| fig-cap: \|`
			#\| Pivoting with `names_to = c(".value", "num")` splits the column names
			`#\| into two components: the first part determines the output column`
			#\| name (`x` or `y`), and the second part determines the value of the
			#\| `num` column.
			`#\| fig-alt: \|`
			`#\| A diagram that uses color to illustrate how the special ".value"`
			`#\| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",`
			`#\| and we want to use the first component ("x", "y") as a variable name`
			`#\| and the second ("1", "2") as the value for a new "num" column.`

			`knitr::include_graphics("../../image/tidy-data/names-and-values.png", dpi = 270)`
			```

			`## 查看数据`

			```{r}
			`cms_patient_experience`
			`cms_patient_experience \|>`
			`distinct(measure_cd, measure_title)`
			```

			`## 数据变形（变宽）`

			```{r}
			`cms_patient_experience \|>`
			`pivot_wider(`
			`names_from = measure_cd,`
			`values_from = prf_rate`
			`)`
			```

			`## 数据变形（变宽）`

			```{r}
			`cms_patient_experience \|>`
			`pivot_wider(`
			`id_cols = starts_with("org"),`
			`names_from = measure_cd,`
			`values_from = prf_rate`
			`)`
			```

			`## 练习`

			```{r}
			`df <- tribble(`
			`~id, ~measurement, ~value,`
			`"A", "bp1", 100,`
			`"B", "bp1", 140,`
			`"B", "bp2", 115,`
			`"A", "bp2", 120,`
			`"A", "bp3", 105`
			`)`
			```

			`## 练习`

			```{r}
			`df \|>`
			`pivot_wider(`
			`names_from = measurement,`
			`values_from = value`
			`)`
			```


			`## 练习`

			```{r}
			`df <- tribble(`
			`~id, ~measurement, ~value,`
			`"A", "bp1", 100,`
			`"A", "bp1", 102,`
			`"A", "bp2", 120,`
			`"B", "bp1", 140,`
			`"B", "bp2", 115`
			`)`
			```

			`## 练习`

			```{r}
			`df \|>`
			`pivot_wider(`
			`names_from = measurement,`
			`values_from = value`
			`)`
			```

			`## 练习`

			```{r}
			`df \|>`
			`group_by(id, measurement) \|>`
			`summarize(n = n(), .groups = "drop") \|>`
			`filter(n > 1)`
			```

			`## 欢迎讨论！{.center}`


			`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drwater.rcees.ac.cn/course/public/RWEP/@PUB/SD/")`