RWEP/SD/20240326_4_datatidy/index.qmd

---
title: "Data tidy"
subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution
author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心
date: today
lang: zh
format:
  revealjs:
    theme: dark
    slide-number: true
    chalkboard:
      buttons: true
    preview-links: auto
    lang: zh
    toc: true
    toc-depth: 1
    toc-title: 大纲
    logo: ./_extensions/inst/img/ucaslogo.png
    css: ./_extensions/inst/css/revealjs.css
    pointer:
      key: "p"
      color: "#32cd32"
      pointerSize: 18
revealjs-plugins:
  - pointer
filters:
  - d2
---

```{r}
#| echo: false
knitr::opts_chunk$set(echo = TRUE)
source("../../coding/_common.R")
library(tidyverse)
```


## tidy data

```{r}
#| label: fig-tidy-structure
#| echo: false
#| fig-cap: |
#|   The following three rules make a dataset tidy: variables are columns,
#|   observations are rows, and values are cells.
#| fig-alt: |
#|   Three panels, each representing a tidy data frame. The first panel
#|   shows that each variable is a column. The second panel shows that each
#|   observation is a row. The third panel shows that each value is
#|   a cell.

knitr::include_graphics("../../image/tidy-1.png", dpi = 270)
```

##  简单计算

```{r}
# Compute rate per 10,000
table1 |>
  mutate(rate = cases / population * 10000)


```

##  简单计算

```{r}
# Compute total cases per year
table1 |>
  group_by(year) |>
  summarize(total_cases = sum(cases))

```

##  可视化

```{r}
#| fig-width: 5
#| fig-alt: |
#|   This figure shows the number of cases in 1999 and 2000 for
#|   Afghanistan, Brazil, and China, with year on the x-axis and number
#|   of cases on the y-axis. Each point on the plot represents the number
#|   of cases in a given country in a given year. The points for each
#|   country are differentiated from others by color and shape and connected
#|   with a line, resulting in three, non-parallel, non-intersecting lines.
#|   The numbers of cases in China are highest for both 1999 and 2000, with
#|   values above 200,000 for both years. The number of cases in Brazil is
#|   approximately 40,000 in 1999 and approximately 75,000 in 2000. The
#|   numbers of cases in Afghanistan are lowest for both 1999 and 2000, with
#|   values that appear to be very close to 0 on this scale.


# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000
```

## 查看数据

```{r}
billboard
```


## 数据变形

```{r}
billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )
```

## 数据变形

```{r}
billboard_longer <- billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  ) |>
  mutate(
    week = parse_number(week)
  )
billboard_longer
```

## 可视化

```{r}
#| label: fig-billboard-ranks
#| fig-cap: |
#|   A line plot showing how the rank of a song changes over time.
#| fig-alt: |
#|   A line plot with week on the x-axis and rank on the y-axis, where
#|   each line represents a song. Most songs appear to start at a high rank,
#|   rapidly accelerate to a low rank, and then decay again. There are
#|   surprisingly few tracks in the region when week is >20 and rank is
#|   >50.

billboard_longer |>
  ggplot(aes(x = week, y = rank, group = track)) +
  geom_line(alpha = 0.25) +
  scale_y_reverse()
```


## 练习

```{r}
df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)
df |>
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
```

## 变形示意图

```{r}
#| label: fig-pivot-variables
#| echo: false
#| fig-cap: |
#|   Columns that are already variables need to be repeated, once for
#|   each column that is pivoted.
#| fig-alt: |
#|   A diagram showing how `pivot_longer()` transforms a simple
#|   dataset, using color to highlight how the values in the `id` column
#|   ("A", "B", "C") are each repeated twice in the output because there are
#|   two columns being pivoted ("bp1" and "bp2").

knitr::include_graphics("../../image/tidy-data/variables.png", dpi = 270)
```

## 查看数据

```{r}
who2
```

## 数据变形

```{r}
who2 |>
  pivot_longer(
    cols = !(country:year),
    names_to = c("diagnosis", "gender", "age"),
    names_sep = "_",
    values_to = "count"
  )
```

## 变形示意图

```{r}
#| label: fig-pivot-multiple-names
#| echo: false
#| fig-cap: |
#|   Pivoting columns with multiple pieces of information in the names
#|   means that each column name now fills in values in multiple output
#|   columns.
#| fig-alt: |
#|   A diagram that uses color to illustrate how supplying `names_sep`
#|   and multiple `names_to` creates multiple variables in the output.
#|   The input has variable names "x_1" and "y_2" which are split up
#|   by "_" to create name and number columns in the output. This is
#|   is similar case with a single `names_to`, but what would have been a
#|   single output variable is now separated into multiple variables.

knitr::include_graphics("../../image/tidy-data/multiple-names.png", dpi = 270)
```

## 查看数据

```{r}
household
```

## 数据变形

```{r}
household |>
  pivot_longer(
    cols = !family,
    names_to = c(".value", "child"),
    names_sep = "_",
    values_drop_na = TRUE
  )
```

## 变形示意图

```{r}
#| label: fig-pivot-names-and-values
#| echo: false
#| fig-cap: |
#|   Pivoting with `names_to = c(".value", "num")` splits the column names
#|   into two components: the first part determines the output column
#|   name (`x` or `y`), and the second part determines the value of the
#|   `num` column.
#| fig-alt: |
#|   A diagram that uses color to illustrate how the special ".value"
#|   sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
#|   and we want to use the first component ("x", "y") as a variable name
#|   and the second ("1", "2") as the value for a new "num" column.

knitr::include_graphics("../../image/tidy-data/names-and-values.png", dpi = 270)
```

## 查看数据

```{r}
cms_patient_experience
cms_patient_experience |>
  distinct(measure_cd, measure_title)
```

## 数据变形（变宽）

```{r}
cms_patient_experience |>
  pivot_wider(
    names_from = measure_cd,
    values_from = prf_rate
  )
```

## 数据变形（变宽）

```{r}
cms_patient_experience |>
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
  )
```

## 练习

```{r}
df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115,
  "A",        "bp2",    120,
  "A",        "bp3",    105
)
```

## 练习

```{r}
df |>
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
```


## 练习

```{r}
df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "A",        "bp1",    102,
  "A",        "bp2",    120,
  "B",        "bp1",    140,
  "B",        "bp2",    115
)
```

## 练习

```{r}
df |>
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
```

## 练习

```{r}
df |>
  group_by(id, measurement) |>
  summarize(n = n(), .groups = "drop") |>
  filter(n > 1)
```

## 欢迎讨论！{.center}


`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drwater.rcees.ac.cn/course/public/RWEP/@PUB/SD/")`