524 lines
11 KiB
Plaintext
524 lines
11 KiB
Plaintext
---
|
||
title: "大数据分析工具"
|
||
subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution
|
||
author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心
|
||
date: today
|
||
lang: zh
|
||
format:
|
||
revealjs:
|
||
theme: dark
|
||
slide-number: true
|
||
chalkboard:
|
||
buttons: true
|
||
preview-links: auto
|
||
lang: zh
|
||
toc: true
|
||
toc-depth: 1
|
||
toc-title: 大纲
|
||
logo: ./_extensions/inst/img/ucaslogo.png
|
||
css: ./_extensions/inst/css/revealjs.css
|
||
pointer:
|
||
key: "p"
|
||
color: "#32cd32"
|
||
pointerSize: 18
|
||
revealjs-plugins:
|
||
- pointer
|
||
filters:
|
||
- d2
|
||
---
|
||
|
||
```{r}
|
||
#| echo: false
|
||
knitr::opts_chunk$set(echo = TRUE)
|
||
source("../../coding/_common.R")
|
||
library(nycflights13)
|
||
library(tidyverse)
|
||
```
|
||
|
||
# 正则表达式
|
||
|
||
## 匹配数字
|
||
|
||
### 匹配数字:
|
||
|
||
- \d:匹配任意数字字符。
|
||
- \d+:匹配一个或多个数字字符。
|
||
- [0-9]: 匹配数字
|
||
|
||
### 匹配字母:
|
||
|
||
- \w:匹配任意字母、数字或下划线字符。
|
||
- \w+:匹配一个或多个字母、数字或下划线字符。
|
||
|
||
## 匹配数字
|
||
|
||
### 匹配空白字符:
|
||
|
||
- \s:匹配任意空白字符,包括空格、制表符、换行符等。
|
||
- \s+:匹配一个或多个空白字符。
|
||
|
||
### 匹配特定字符:
|
||
|
||
- [abc]:匹配字符 a、b 或 c 中的任意一个。
|
||
- [a-z]:匹配任意小写字母。
|
||
- [A-Z]:匹配任意大写字母。
|
||
- [0-9]:匹配任意数字。
|
||
|
||
## 匹配数字
|
||
|
||
### 匹配重复次数:
|
||
|
||
- {n}:匹配前一个字符恰好 n 次。
|
||
- {n,}:匹配前一个字符至少 n 次。
|
||
- {n,m}:匹配前一个字符至少 n 次,但不超过 m 次。
|
||
|
||
### 匹配边界:
|
||
|
||
- ^:匹配字符串的开头。
|
||
- $:匹配字符串的结尾。
|
||
|
||
## 匹配数字
|
||
|
||
### 匹配特殊字符:
|
||
|
||
- \:转义特殊字符,使其按字面意义匹配。
|
||
- .:匹配任意单个字符。
|
||
- |:表示“或”关系,匹配两个或多个表达式之一。
|
||
|
||
#### 匹配次数:
|
||
|
||
- *:匹配前一个字符零次或多次。
|
||
- +:匹配前一个字符一次或多次。
|
||
- ?:匹配前一个字符零次或一次。
|
||
|
||
## 匹配数字
|
||
|
||
### 分组和捕获:
|
||
|
||
- ():将一系列模式组合成一个单元,可与特殊字符一起使用。
|
||
|
||
### 预定义字符集:
|
||
|
||
- \d:任意数字,相当于 [0-9]。
|
||
- \w:任意字母、数字或下划线字符,相当于 [a-zA-Z0-9_]。
|
||
- \s:任意空白字符,相当于 [ \t\n\r\f\v]。
|
||
|
||
|
||
|
||
## 实例
|
||
|
||
```{r}
|
||
library(babynames)
|
||
(x <- c("apple", "apppple", "abc123def"))
|
||
x[str_detect(x, "[0-9]")]
|
||
x[str_detect(x, "abc[0-9]+")]
|
||
x[str_detect(x, "pp")]
|
||
x[str_detect(x, "p{4}")]
|
||
x[str_detect(x, "p{4}")]
|
||
x[str_detect("apple", "ap*")]
|
||
x[str_detect("apple", "app*")]
|
||
x[str_detect("apple", "a..le")]
|
||
```
|
||
|
||
## 练习
|
||
|
||
|
||
找出`babyname`中名字含有ar的行
|
||
|
||
```{r}
|
||
#| echo: false
|
||
babynames |>
|
||
filter(str_detect(name, "ar"))
|
||
```
|
||
|
||
## 练习
|
||
|
||
|
||
找出`babyname`中名字含有ar或者以ry结尾的行。
|
||
|
||
```{r}
|
||
#| echo: false
|
||
babynames |>
|
||
filter(str_detect(name, "ar"))
|
||
```
|
||
|
||
|
||
|
||
# Linux基础知识与开发工具
|
||
|
||
|
||
## SSH - 安全远程连接
|
||
|
||
```bash
|
||
# 连接到远程服务器
|
||
ssh username@remote.server.com
|
||
|
||
# 使用特定端口连接
|
||
ssh -p 2222 username@remote.server.com
|
||
|
||
# 执行远程命令
|
||
ssh username@server "ls -l /tmp"
|
||
```
|
||
|
||
**案例**:
|
||
- 远程管理云服务器
|
||
- 自动化脚本执行
|
||
|
||
|
||
## Windows下的SSH工具 - PuTTY
|
||
|
||

|
||
|
||
```bash
|
||
# 主要功能:
|
||
1. 保存会话配置(IP/端口/认证信息)
|
||
2. 支持SSH/Telnet/Serial连接
|
||
3. 公钥认证(配合Pageant使用)
|
||
4. 端口转发(SSH隧道)
|
||
```
|
||
|
||
**案例**:
|
||
- 连接Linux服务器进行管理
|
||
- 建立SSH隧道访问内网资源
|
||
|
||
|
||
|
||
## SCP - 安全文件传输
|
||
|
||
```bash
|
||
# 复制本地文件到远程
|
||
scp file.txt username@remote:/path/to/dest
|
||
|
||
# 从远程复制到本地
|
||
scp username@remote:/path/file.txt .
|
||
|
||
# 递归复制目录
|
||
scp -r dir/ username@remote:/path/
|
||
```
|
||
|
||
**案例**:
|
||
- 部署网站文件到生产环境
|
||
- 备份远程日志文件
|
||
|
||
|
||
## Windows下的SCP工具 - WinSCP
|
||
|
||

|
||
|
||
```bash
|
||
# 主要特性:
|
||
- 图形化SFTP/SCP客户端
|
||
- 与PuTTY集成
|
||
- 支持拖拽操作
|
||
- 可保存常用连接
|
||
- 批处理脚本功能
|
||
```
|
||
|
||
**案例**:
|
||
- 可视化管理服务器文件
|
||
- 本地与服务器间同步代码
|
||
|
||
|
||
## Windows替代bash的工具
|
||
|
||
**1. Git Bash**
|
||
```bash
|
||
# 包含常用Linux命令
|
||
ls, grep, ssh, scp, awk等
|
||
```
|
||
|
||
**2. WSL (Windows Subsystem for Linux)**
|
||
```bash
|
||
# 完整Linux环境
|
||
sudo apt install python3
|
||
```
|
||
|
||
**3. Cygwin**
|
||
```bash
|
||
# POSIX兼容环境
|
||
cygwin.com/setup-x86_64.exe
|
||
```
|
||
|
||
|
||
## Windows终端工具推荐
|
||
|
||
| 工具 | 特点 | 适用场景 |
|
||
|------|------|----------|
|
||
| **Windows Terminal** | 多标签/色彩支持 | 日常开发 |
|
||
| **MobaXterm** | 内置X11/插件 | 远程开发 |
|
||
| **Tabby** | 跨平台/主题丰富 | 多平台用户 |
|
||
| **ConEmu** | 高度可定制 | 高级用户 |
|
||
|
||
|
||
|
||
## 开发工具跨平台方案
|
||
|
||
**最佳实践**:
|
||
1. 代码编辑器统一使用VS Code(全平台支持)
|
||
- 配合Remote-SSH插件
|
||
2. 数据库工具使用DBeaver/DataGrip
|
||
3. 版本控制使用Git GUI(GitKraken/Fork)
|
||
|
||
```bash
|
||
# 保持环境一致的建议:
|
||
- 使用WSL2开发环境
|
||
- 配置相同的.ssh/config文件
|
||
- 共享相同的IDE配置
|
||
```
|
||
|
||
|
||
## grep - 文本搜索
|
||
|
||
```bash
|
||
# 基本搜索
|
||
grep "error" logfile.txt
|
||
|
||
# 递归搜索目录
|
||
grep -r "function" src/
|
||
|
||
# 显示行号
|
||
grep -n "TODO" *.py
|
||
|
||
# 反向匹配
|
||
grep -v "success" results.log
|
||
```
|
||
|
||
**案例**:
|
||
- 在日志中查找错误信息
|
||
- 分析代码库中的特定模式
|
||
|
||
|
||
## sed - 流编辑器
|
||
|
||
```bash
|
||
# 替换文本
|
||
sed 's/old/new/g' file.txt
|
||
|
||
# 删除空行
|
||
sed '/^$/d' file.txt
|
||
|
||
# 原地编辑文件
|
||
sed -i 's/python/python3/g' script.sh
|
||
```
|
||
|
||
**案例**:
|
||
- 批量重命名文件中的字符串
|
||
- 清理数据文件中的不规范格式
|
||
|
||
|
||
## awk - 文本处理
|
||
|
||
```bash
|
||
# 打印特定列
|
||
awk '{print $1,$3}' data.csv
|
||
|
||
# 条件过滤
|
||
awk '$3 > 100 {print $0}' sales.txt
|
||
|
||
# 使用分隔符
|
||
awk -F',' '{print $2}' users.csv
|
||
```
|
||
|
||
**案例**:
|
||
- 分析服务器日志统计状态码
|
||
- 处理CSV格式数据
|
||
|
||
|
||
## find & xargs - 文件查找处理
|
||
|
||
```bash
|
||
# 查找并删除
|
||
find . -name "*.tmp" -delete
|
||
|
||
# 查找并处理
|
||
find /var/log -name "*.log" | xargs ls -lh
|
||
|
||
# 复杂组合
|
||
find src/ -type f -name "*.js" | xargs grep -l "deprecated"
|
||
```
|
||
|
||
**案例**:
|
||
- 清理旧临时文件
|
||
- 批量处理项目文件
|
||
|
||
|
||
## 代码编辑器
|
||
|
||
- **VS Code**:现代轻量级编辑器
|
||
- 丰富的插件生态
|
||
- 内置Git支持
|
||
- **Vim**:终端高效编辑器
|
||
```bash
|
||
vim file.txt
|
||
```
|
||
- **RStudio**:R语言集成环境
|
||
- **JupyterLab**:交互式笔记本环境
|
||
|
||
|
||
|
||
## Git版本控制
|
||
|
||
```bash
|
||
# 基本工作流
|
||
git clone https://repo.url
|
||
git add .
|
||
git commit -m "message"
|
||
git push
|
||
|
||
# 分支管理
|
||
git checkout -b new-feature
|
||
git merge main
|
||
```
|
||
|
||
**案例**:
|
||
- 团队协作开发
|
||
- 版本回滚与问题追踪
|
||
|
||
|
||
## MySQL数据库
|
||
|
||
```sql
|
||
-- 基本查询
|
||
SELECT * FROM users WHERE age > 18;
|
||
|
||
-- 创建表
|
||
CREATE TABLE products (
|
||
id INT PRIMARY KEY,
|
||
name VARCHAR(100)
|
||
);
|
||
|
||
-- 数据操作
|
||
INSERT INTO products VALUES (1, 'Laptop');
|
||
UPDATE products SET price=999 WHERE id=1;
|
||
```
|
||
|
||
**案例**:
|
||
- Web应用数据存储
|
||
- 数据分析与报表生成
|
||
|
||
|
||
|
||
# 公开数据获取
|
||
|
||
## 案例:全国气象数据
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#!/bin/bash
|
||
logfn="${HOME}/service/log/nationalairquality/nationalairquality.log"
|
||
workdirfn="${HOME}/service/nationalairquality/"
|
||
mkdir -p "$(dirname "${logfn}")"
|
||
touch "${logfn}"
|
||
|
||
echo "$(date '+%Y-%m-%d %H:%M:%S'): 下载大气质量数据" >>"${logfn}"
|
||
|
||
declare -a citynames
|
||
|
||
citynames=(北京市 石家庄市 秦皇岛市)
|
||
|
||
jsonfn="${workdirfn}/nationalairquality_$(date '+%Y%d%m%H').json"
|
||
jsonfn="nationalairquality_$(date '+%Y%d%m%H').json"
|
||
# [[ -f "${jsonfn}" ]] && rm "${jsonfn}"
|
||
echo "[" >"${jsonfn}"
|
||
|
||
for cityname in "${citynames[@]}"; do
|
||
echo "下载${cityname}空气质量数据..." >>"${logfn}"
|
||
curl "https://air.cnemc.cn:18007/CityData/GetAQIDataPublishLive?cityName=${cityname}" \
|
||
-H 'Accept: */*' \
|
||
-H 'Accept-Language: en-US,en;q=0.9' \
|
||
-H 'Connection: keep-alive' \
|
||
-H 'Referer: https://air.cnemc.cn:18007/' \
|
||
-H 'Sec-Fetch-Dest: empty' \
|
||
-H 'Sec-Fetch-Mode: cors' \
|
||
-H 'Sec-Fetch-Site: same-origin' \
|
||
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36' \
|
||
-H 'X-Requested-With: XMLHttpRequest' \
|
||
-H 'sec-ch-ua: "Google Chrome";v="119", "Chromium";v="119", "Not?A_Brand";v="24"' \
|
||
-H 'sec-ch-ua-mobile: ?0' \
|
||
-H 'sec-ch-ua-platform: "macOS"' \
|
||
--compressed --silent | jq | sed -e '1d' -e '$d' -e 's/^\( *}\)$/\1,/' >>"${jsonfn}"
|
||
done
|
||
sed -i '' '$s/},/}/' "${jsonfn}"
|
||
echo "]" >>"${jsonfn}"
|
||
|
||
echo "大气质量数据下载完成!" >>"${logfn}"
|
||
echo "开始上传数据库..." >>"${logfn}"
|
||
/home/ming/bin/getnationalairquality.R "${jsonfn}"
|
||
echo "上传数据库完毕!" >>"${logfn}"
|
||
```
|
||
|
||
## R代码
|
||
|
||
```{r}
|
||
#| eval: false
|
||
#| echo: true
|
||
|
||
#!/usr/bin/Rscript
|
||
|
||
jsonfn <- commandArgs(TRUE)[1]
|
||
# jsonfn <- "~/nationalairquality_2023191219.json"
|
||
jsondf <- jsonlite::fromJSON(jsonfn, flatten = TRUE)
|
||
# metadf <- tibble::as_tibble(jsondf) |>
|
||
# dplyr::select(site = StationCode, name = PositionName, Area, lon = Longitude, lat = Latitude) |>
|
||
# dplyr::mutate(lon = as.numeric(lon), lat = as.numeric(lat))
|
||
# DBI::dbWriteTable(conn, "metadf", metadf, overwrite = TRUE, row.names = FALSE)
|
||
airqualitydf <- tibble::as_tibble(jsondf) |>
|
||
dplyr::select(
|
||
datetime = TimePoint,
|
||
site = StationCode,
|
||
`CO_mg/m3` = CO,
|
||
`CO_24h_mg/m3` = CO_24h,
|
||
`NO2_μg/m3` = NO2,
|
||
`NO2_24h_μg/m3` = NO2_24h,
|
||
`O3_μg/m3` = O3,
|
||
`O3_24h_μg/m3` = O3_24h,
|
||
`O3_8h_μg/m3` = O3_8h,
|
||
`O3_8h_24h_μg/m3` = O3_8h_24h,
|
||
`PM10_μg/m3` = PM10,
|
||
`PM10_24h_μg/m3` = PM10_24h,
|
||
`PM2.5_μg/m3` = PM2_5,
|
||
`PM2.5_24h_μg/m3` = PM2_5_24h,
|
||
`SO2_μg/m3` = SO2,
|
||
`SO2_24h_μg/m3` = SO2_24h,
|
||
`NO_μg/m3` = NO,
|
||
`NO_24h_μg/m3` = NO_24h,
|
||
`NOx_μg/m3` = NOx,
|
||
`NOx_24h_μg/m3` = NOx_24h,
|
||
AQI,
|
||
COLevel,
|
||
NO2Level,
|
||
O3Level,
|
||
O3_8hLevel,
|
||
PM10Level,
|
||
PM2_5Level,
|
||
SO2Level,
|
||
PrimaryPollutant,
|
||
Quality,
|
||
Unheathful
|
||
) |>
|
||
dplyr::mutate(dplyr::across(
|
||
`CO_mg/m3`:AQI,
|
||
~ round(readr::parse_number(.x), 4)
|
||
)) |>
|
||
dplyr::mutate(datetime = lubridate::as_datetime(datetime))
|
||
conn <- cctdb::get_dbconn("nationalairquality", writepermission = TRUE)
|
||
DBI::dbWriteTable(
|
||
conn,
|
||
"airqualitydf",
|
||
airqualitydf,
|
||
append = TRUE,
|
||
row.names = FALSE
|
||
)
|
||
DBI::dbDisconnect(conn)
|
||
```
|
||
|
||
|
||
|
||
## 欢迎讨论!{.center}
|
||
|
||
|
||
`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drc.drwater.net/course/public/RWEP/PUB/SD/")`
|
||
|
||
|
||
|