--- title: "Data tidy" subtitle: 《区域水环境污染数据分析实践》
Data analysis practice of regional water environment pollution author: 苏命、王为东
中国科学院大学资源与环境学院
中国科学院生态环境研究中心 date: today lang: zh format: revealjs: theme: dark slide-number: true chalkboard: buttons: true preview-links: auto lang: zh toc: true toc-depth: 1 toc-title: 大纲 logo: ./_extensions/inst/img/ucaslogo.png css: ./_extensions/inst/css/revealjs.css pointer: key: "p" color: "#32cd32" pointerSize: 18 revealjs-plugins: - pointer filters: - d2 --- ```{r} #| echo: false knitr::opts_chunk$set(echo = TRUE) source("../../coding/_common.R") library(tidyverse) ``` ## tidy data ```{r} #| label: fig-tidy-structure #| echo: false #| fig-cap: | #| The following three rules make a dataset tidy: variables are columns, #| observations are rows, and values are cells. #| fig-alt: | #| Three panels, each representing a tidy data frame. The first panel #| shows that each variable is a column. The second panel shows that each #| observation is a row. The third panel shows that each value is #| a cell. knitr::include_graphics("../../image/tidy-1.png", dpi = 270) ``` ## 简单计算 ```{r} # Compute rate per 10,000 table1 |> mutate(rate = cases / population * 10000) ``` ## 简单计算 ```{r} # Compute total cases per year table1 |> group_by(year) |> summarize(total_cases = sum(cases)) ``` ## 可视化 ```{r} #| fig-width: 5 #| fig-alt: | #| This figure shows the number of cases in 1999 and 2000 for #| Afghanistan, Brazil, and China, with year on the x-axis and number #| of cases on the y-axis. Each point on the plot represents the number #| of cases in a given country in a given year. The points for each #| country are differentiated from others by color and shape and connected #| with a line, resulting in three, non-parallel, non-intersecting lines. #| The numbers of cases in China are highest for both 1999 and 2000, with #| values above 200,000 for both years. The number of cases in Brazil is #| approximately 40,000 in 1999 and approximately 75,000 in 2000. The #| numbers of cases in Afghanistan are lowest for both 1999 and 2000, with #| values that appear to be very close to 0 on this scale. # Visualize changes over time ggplot(table1, aes(x = year, y = cases)) + geom_line(aes(group = country), color = "grey50") + geom_point(aes(color = country, shape = country)) + scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000 ``` ## 查看数据 ```{r} billboard ``` ## 数据变形 ```{r} billboard |> pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE ) ``` ## 数据变形 ```{r} billboard_longer <- billboard |> pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE ) |> mutate( week = parse_number(week) ) billboard_longer ``` ## 可视化 ```{r} #| label: fig-billboard-ranks #| fig-cap: | #| A line plot showing how the rank of a song changes over time. #| fig-alt: | #| A line plot with week on the x-axis and rank on the y-axis, where #| each line represents a song. Most songs appear to start at a high rank, #| rapidly accelerate to a low rank, and then decay again. There are #| surprisingly few tracks in the region when week is >20 and rank is #| >50. billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line(alpha = 0.25) + scale_y_reverse() ``` ## 练习 ```{r} df <- tribble( ~id, ~bp1, ~bp2, "A", 100, 120, "B", 140, 115, "C", 120, 125 ) df |> pivot_longer( cols = bp1:bp2, names_to = "measurement", values_to = "value" ) ``` ## 变形示意图 ```{r} #| label: fig-pivot-variables #| echo: false #| fig-cap: | #| Columns that are already variables need to be repeated, once for #| each column that is pivoted. #| fig-alt: | #| A diagram showing how `pivot_longer()` transforms a simple #| dataset, using color to highlight how the values in the `id` column #| ("A", "B", "C") are each repeated twice in the output because there are #| two columns being pivoted ("bp1" and "bp2"). knitr::include_graphics("../../image/tidy-data/variables.png", dpi = 270) ``` ## 查看数据 ```{r} who2 ``` ## 数据变形 ```{r} who2 |> pivot_longer( cols = !(country:year), names_to = c("diagnosis", "gender", "age"), names_sep = "_", values_to = "count" ) ``` ## 变形示意图 ```{r} #| label: fig-pivot-multiple-names #| echo: false #| fig-cap: | #| Pivoting columns with multiple pieces of information in the names #| means that each column name now fills in values in multiple output #| columns. #| fig-alt: | #| A diagram that uses color to illustrate how supplying `names_sep` #| and multiple `names_to` creates multiple variables in the output. #| The input has variable names "x_1" and "y_2" which are split up #| by "_" to create name and number columns in the output. This is #| is similar case with a single `names_to`, but what would have been a #| single output variable is now separated into multiple variables. knitr::include_graphics("../../image/tidy-data/multiple-names.png", dpi = 270) ``` ## 查看数据 ```{r} household ``` ## 数据变形 ```{r} household |> pivot_longer( cols = !family, names_to = c(".value", "child"), names_sep = "_", values_drop_na = TRUE ) ``` ## 变形示意图 ```{r} #| label: fig-pivot-names-and-values #| echo: false #| fig-cap: | #| Pivoting with `names_to = c(".value", "num")` splits the column names #| into two components: the first part determines the output column #| name (`x` or `y`), and the second part determines the value of the #| `num` column. #| fig-alt: | #| A diagram that uses color to illustrate how the special ".value" #| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2", #| and we want to use the first component ("x", "y") as a variable name #| and the second ("1", "2") as the value for a new "num" column. knitr::include_graphics("../../image/tidy-data/names-and-values.png", dpi = 270) ``` ## 查看数据 ```{r} cms_patient_experience cms_patient_experience |> distinct(measure_cd, measure_title) ``` ## 数据变形(变宽) ```{r} cms_patient_experience |> pivot_wider( names_from = measure_cd, values_from = prf_rate ) ``` ## 数据变形(变宽) ```{r} cms_patient_experience |> pivot_wider( id_cols = starts_with("org"), names_from = measure_cd, values_from = prf_rate ) ``` ## 练习 ```{r} df <- tribble( ~id, ~measurement, ~value, "A", "bp1", 100, "B", "bp1", 140, "B", "bp2", 115, "A", "bp2", 120, "A", "bp3", 105 ) ``` ## 练习 ```{r} df |> pivot_wider( names_from = measurement, values_from = value ) ``` ## 练习 ```{r} df <- tribble( ~id, ~measurement, ~value, "A", "bp1", 100, "A", "bp1", 102, "A", "bp2", 120, "B", "bp1", 140, "B", "bp2", 115 ) ``` ## 练习 ```{r} df |> pivot_wider( names_from = measurement, values_from = value ) ``` ## 练习 ```{r} df |> group_by(id, measurement) |> summarize(n = n(), .groups = "drop") |> filter(n > 1) ``` ## 欢迎讨论!{.center} `r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drwater.rcees.ac.cn/course/public/RWEP/@PUB/SD/")`