--- title: "Data Transform" format: dwsd-revealjs: logo: _extensions/drwater/dwsd/inst/ucaslogo.png --- ```{r} #| include: false #| cache: false lang <- "cn" require(tidyverse) require(learnr) ``` ```{r} #| echo: false source("../../coding/_common.R") library(nycflights13) library(tidyverse) ``` ## `tidyverse`风格数据分析总体流程 ![](../../image/data-science/transform.png) ## [dplyr cheatsheet](../../image/cheatsheet/data-transformation.pdf) ```{r} #| echo: false dwfun::ggsavep("../../image/cheatsheet/data-transformation.svg", loadit = TRUE) ``` ## 查看数据 ```{r} flights ``` ## 选择列 ```{r} #| results: false flights |> select(year, month, day) ``` ## 选择列 ```{r} #| results: false flights |> select(year:day) ``` ## 选择列 ```{r} flights |> select(3:5) ``` ## 选择列 ```{r} flights |> select(!year:day) ``` ## 选择列 ```{r} flights |> select(-(year:day)) ``` ## 选择列 ```{r} flights |> select(where(is.character)) ``` ## 选择列 ```{r} flights |> select(!where(is.character)) |> select(contains("_")) ``` ## 选择列 ```{r} flights |> select(tail_num = tailnum) ``` ## 选择列 ```{r} flights |> select(air_time, everything()) ``` ## 重命名 ```{r} flights |> rename(tail_num = tailnum) ``` ## 重命名 ```{r} flights |> rename(年份 = 1) |> rename(月份 = 2) ``` ## 重命名 ```{r} flights |> select(1:4) |> head(n = 3) # 重命名 flights |> select(1:4) |> head(n = 3) |> rename_all(~ c("c1", "c2", "c3", "c4")) ``` ## 重命名 ```{r} flights |> select(1:4) |> head(n = 3) # 重命名 flights |> select(1:4) |> head(n = 3) |> rename_all(toupper) ``` ## 重命名 ```{r} flights |> select(1:4) |> head(n = 3) # 重命名 flights |> select(1:4) |> head(n = 3) |> rename_all(~ paste0(toupper(.), "_NEW")) ``` ## 练习 将含有下划线的列名中的下划线去掉。 ```{r} flights |> select(1:4) |> head(n = 3) ``` ## 练习 将`airqualitydf`中列名的单位信息去除(前5列)。 ```{r} airqualitydf <- readxl::read_xlsx("../../data/airquality.xlsx", sheet = 2) airqualitydf |> select(1:5) ``` ## `filter` ```{r} flights |> filter(dep_delay > 120) ``` ## filter 练习 Flights that departed on January 1. ```{r} #| echo: false flights |> filter(month == 1 & day == 1) ``` ## filter 练习 Select flights that departed in January or February ```{r} #| echo: false flights |> filter(month %in% c(1, 2)) ``` ## filter 练习 ```{r} jan1 <- flights |> filter(month == 1 & day == 1) ``` ## filter ```{r} #| error: true #| eval: false flights |> filter(month = 1) ``` ## filter ```{r} flights |> filter(month == 1 | 2) ``` ## 排序 ```{r} flights |> arrange(year, month, day, dep_time) ``` ## 排序 ```{r} flights |> arrange(desc(dep_delay)) ``` ## slice ```{r} flights |> head(n = 5) flights |> slice(1:5) ``` ## slice ```{r} flights |> slice_max(dep_delay, n = 5) ``` ## slice ```{r} flights |> slice_min(dep_delay, prop = 0.005) ``` ## 排序练习 根据`origin`、`dest`、`air_time`倒序排序。 ```{r} #| echo: false flights |> arrange(origin, dest, desc(air_time)) |> select(origin, dest, air_time, everything()) ``` ## 去重 ```{r} # Remove duplicate rows, if any flights |> distinct() ``` ## 去重 ```{r} # Find all unique origin and destination pairs flights |> distinct(origin, dest) ``` ## 去重 ```{r} flights |> distinct(origin, dest, .keep_all = TRUE) ``` ## 欢迎讨论!{.center} ```{r} #| results: 'asis' #| echo: false rmdify::slideend( wechat = FALSE, type = "public", tel = FALSE, thislink = "../" ) ```