update
This commit is contained in:
1
SD/5.3_大数据分析工具/_extensions
Symbolic link
1
SD/5.3_大数据分析工具/_extensions
Symbolic link
@@ -0,0 +1 @@
|
||||
../../_extensions
|
||||
523
SD/5.3_大数据分析工具/index.qmd
Normal file
523
SD/5.3_大数据分析工具/index.qmd
Normal file
@@ -0,0 +1,523 @@
|
||||
---
|
||||
title: "大数据分析工具"
|
||||
subtitle: 《区域水环境污染数据分析实践》<br>Data analysis practice of regional water environment pollution
|
||||
author: 苏命、王为东<br>中国科学院大学资源与环境学院<br>中国科学院生态环境研究中心
|
||||
date: today
|
||||
lang: zh
|
||||
format:
|
||||
revealjs:
|
||||
theme: dark
|
||||
slide-number: true
|
||||
chalkboard:
|
||||
buttons: true
|
||||
preview-links: auto
|
||||
lang: zh
|
||||
toc: true
|
||||
toc-depth: 1
|
||||
toc-title: 大纲
|
||||
logo: ./_extensions/inst/img/ucaslogo.png
|
||||
css: ./_extensions/inst/css/revealjs.css
|
||||
pointer:
|
||||
key: "p"
|
||||
color: "#32cd32"
|
||||
pointerSize: 18
|
||||
revealjs-plugins:
|
||||
- pointer
|
||||
filters:
|
||||
- d2
|
||||
---
|
||||
|
||||
```{r}
|
||||
#| echo: false
|
||||
knitr::opts_chunk$set(echo = TRUE)
|
||||
source("../../coding/_common.R")
|
||||
library(nycflights13)
|
||||
library(tidyverse)
|
||||
```
|
||||
|
||||
# 正则表达式
|
||||
|
||||
## 匹配数字
|
||||
|
||||
### 匹配数字:
|
||||
|
||||
- \d:匹配任意数字字符。
|
||||
- \d+:匹配一个或多个数字字符。
|
||||
- [0-9]: 匹配数字
|
||||
|
||||
### 匹配字母:
|
||||
|
||||
- \w:匹配任意字母、数字或下划线字符。
|
||||
- \w+:匹配一个或多个字母、数字或下划线字符。
|
||||
|
||||
## 匹配数字
|
||||
|
||||
### 匹配空白字符:
|
||||
|
||||
- \s:匹配任意空白字符,包括空格、制表符、换行符等。
|
||||
- \s+:匹配一个或多个空白字符。
|
||||
|
||||
### 匹配特定字符:
|
||||
|
||||
- [abc]:匹配字符 a、b 或 c 中的任意一个。
|
||||
- [a-z]:匹配任意小写字母。
|
||||
- [A-Z]:匹配任意大写字母。
|
||||
- [0-9]:匹配任意数字。
|
||||
|
||||
## 匹配数字
|
||||
|
||||
### 匹配重复次数:
|
||||
|
||||
- {n}:匹配前一个字符恰好 n 次。
|
||||
- {n,}:匹配前一个字符至少 n 次。
|
||||
- {n,m}:匹配前一个字符至少 n 次,但不超过 m 次。
|
||||
|
||||
### 匹配边界:
|
||||
|
||||
- ^:匹配字符串的开头。
|
||||
- $:匹配字符串的结尾。
|
||||
|
||||
## 匹配数字
|
||||
|
||||
### 匹配特殊字符:
|
||||
|
||||
- \:转义特殊字符,使其按字面意义匹配。
|
||||
- .:匹配任意单个字符。
|
||||
- |:表示“或”关系,匹配两个或多个表达式之一。
|
||||
|
||||
#### 匹配次数:
|
||||
|
||||
- *:匹配前一个字符零次或多次。
|
||||
- +:匹配前一个字符一次或多次。
|
||||
- ?:匹配前一个字符零次或一次。
|
||||
|
||||
## 匹配数字
|
||||
|
||||
### 分组和捕获:
|
||||
|
||||
- ():将一系列模式组合成一个单元,可与特殊字符一起使用。
|
||||
|
||||
### 预定义字符集:
|
||||
|
||||
- \d:任意数字,相当于 [0-9]。
|
||||
- \w:任意字母、数字或下划线字符,相当于 [a-zA-Z0-9_]。
|
||||
- \s:任意空白字符,相当于 [ \t\n\r\f\v]。
|
||||
|
||||
|
||||
|
||||
## 实例
|
||||
|
||||
```{r}
|
||||
library(babynames)
|
||||
(x <- c("apple", "apppple", "abc123def"))
|
||||
x[str_detect(x, "[0-9]")]
|
||||
x[str_detect(x, "abc[0-9]+")]
|
||||
x[str_detect(x, "pp")]
|
||||
x[str_detect(x, "p{4}")]
|
||||
x[str_detect(x, "p{4}")]
|
||||
x[str_detect("apple", "ap*")]
|
||||
x[str_detect("apple", "app*")]
|
||||
x[str_detect("apple", "a..le")]
|
||||
```
|
||||
|
||||
## 练习
|
||||
|
||||
|
||||
找出`babyname`中名字含有ar的行
|
||||
|
||||
```{r}
|
||||
#| echo: false
|
||||
babynames |>
|
||||
filter(str_detect(name, "ar"))
|
||||
```
|
||||
|
||||
## 练习
|
||||
|
||||
|
||||
找出`babyname`中名字含有ar或者以ry结尾的行。
|
||||
|
||||
```{r}
|
||||
#| echo: false
|
||||
babynames |>
|
||||
filter(str_detect(name, "ar"))
|
||||
```
|
||||
|
||||
|
||||
|
||||
# Linux基础知识与开发工具
|
||||
|
||||
|
||||
## SSH - 安全远程连接
|
||||
|
||||
```bash
|
||||
# 连接到远程服务器
|
||||
ssh username@remote.server.com
|
||||
|
||||
# 使用特定端口连接
|
||||
ssh -p 2222 username@remote.server.com
|
||||
|
||||
# 执行远程命令
|
||||
ssh username@server "ls -l /tmp"
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 远程管理云服务器
|
||||
- 自动化脚本执行
|
||||
|
||||
|
||||
## Windows下的SSH工具 - PuTTY
|
||||
|
||||

|
||||
|
||||
```bash
|
||||
# 主要功能:
|
||||
1. 保存会话配置(IP/端口/认证信息)
|
||||
2. 支持SSH/Telnet/Serial连接
|
||||
3. 公钥认证(配合Pageant使用)
|
||||
4. 端口转发(SSH隧道)
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 连接Linux服务器进行管理
|
||||
- 建立SSH隧道访问内网资源
|
||||
|
||||
|
||||
|
||||
## SCP - 安全文件传输
|
||||
|
||||
```bash
|
||||
# 复制本地文件到远程
|
||||
scp file.txt username@remote:/path/to/dest
|
||||
|
||||
# 从远程复制到本地
|
||||
scp username@remote:/path/file.txt .
|
||||
|
||||
# 递归复制目录
|
||||
scp -r dir/ username@remote:/path/
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 部署网站文件到生产环境
|
||||
- 备份远程日志文件
|
||||
|
||||
|
||||
## Windows下的SCP工具 - WinSCP
|
||||
|
||||

|
||||
|
||||
```bash
|
||||
# 主要特性:
|
||||
- 图形化SFTP/SCP客户端
|
||||
- 与PuTTY集成
|
||||
- 支持拖拽操作
|
||||
- 可保存常用连接
|
||||
- 批处理脚本功能
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 可视化管理服务器文件
|
||||
- 本地与服务器间同步代码
|
||||
|
||||
|
||||
## Windows替代bash的工具
|
||||
|
||||
**1. Git Bash**
|
||||
```bash
|
||||
# 包含常用Linux命令
|
||||
ls, grep, ssh, scp, awk等
|
||||
```
|
||||
|
||||
**2. WSL (Windows Subsystem for Linux)**
|
||||
```bash
|
||||
# 完整Linux环境
|
||||
sudo apt install python3
|
||||
```
|
||||
|
||||
**3. Cygwin**
|
||||
```bash
|
||||
# POSIX兼容环境
|
||||
cygwin.com/setup-x86_64.exe
|
||||
```
|
||||
|
||||
|
||||
## Windows终端工具推荐
|
||||
|
||||
| 工具 | 特点 | 适用场景 |
|
||||
|------|------|----------|
|
||||
| **Windows Terminal** | 多标签/色彩支持 | 日常开发 |
|
||||
| **MobaXterm** | 内置X11/插件 | 远程开发 |
|
||||
| **Tabby** | 跨平台/主题丰富 | 多平台用户 |
|
||||
| **ConEmu** | 高度可定制 | 高级用户 |
|
||||
|
||||
|
||||
|
||||
## 开发工具跨平台方案
|
||||
|
||||
**最佳实践**:
|
||||
1. 代码编辑器统一使用VS Code(全平台支持)
|
||||
- 配合Remote-SSH插件
|
||||
2. 数据库工具使用DBeaver/DataGrip
|
||||
3. 版本控制使用Git GUI(GitKraken/Fork)
|
||||
|
||||
```bash
|
||||
# 保持环境一致的建议:
|
||||
- 使用WSL2开发环境
|
||||
- 配置相同的.ssh/config文件
|
||||
- 共享相同的IDE配置
|
||||
```
|
||||
|
||||
|
||||
## grep - 文本搜索
|
||||
|
||||
```bash
|
||||
# 基本搜索
|
||||
grep "error" logfile.txt
|
||||
|
||||
# 递归搜索目录
|
||||
grep -r "function" src/
|
||||
|
||||
# 显示行号
|
||||
grep -n "TODO" *.py
|
||||
|
||||
# 反向匹配
|
||||
grep -v "success" results.log
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 在日志中查找错误信息
|
||||
- 分析代码库中的特定模式
|
||||
|
||||
|
||||
## sed - 流编辑器
|
||||
|
||||
```bash
|
||||
# 替换文本
|
||||
sed 's/old/new/g' file.txt
|
||||
|
||||
# 删除空行
|
||||
sed '/^$/d' file.txt
|
||||
|
||||
# 原地编辑文件
|
||||
sed -i 's/python/python3/g' script.sh
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 批量重命名文件中的字符串
|
||||
- 清理数据文件中的不规范格式
|
||||
|
||||
|
||||
## awk - 文本处理
|
||||
|
||||
```bash
|
||||
# 打印特定列
|
||||
awk '{print $1,$3}' data.csv
|
||||
|
||||
# 条件过滤
|
||||
awk '$3 > 100 {print $0}' sales.txt
|
||||
|
||||
# 使用分隔符
|
||||
awk -F',' '{print $2}' users.csv
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 分析服务器日志统计状态码
|
||||
- 处理CSV格式数据
|
||||
|
||||
|
||||
## find & xargs - 文件查找处理
|
||||
|
||||
```bash
|
||||
# 查找并删除
|
||||
find . -name "*.tmp" -delete
|
||||
|
||||
# 查找并处理
|
||||
find /var/log -name "*.log" | xargs ls -lh
|
||||
|
||||
# 复杂组合
|
||||
find src/ -type f -name "*.js" | xargs grep -l "deprecated"
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 清理旧临时文件
|
||||
- 批量处理项目文件
|
||||
|
||||
|
||||
## 代码编辑器
|
||||
|
||||
- **VS Code**:现代轻量级编辑器
|
||||
- 丰富的插件生态
|
||||
- 内置Git支持
|
||||
- **Vim**:终端高效编辑器
|
||||
```bash
|
||||
vim file.txt
|
||||
```
|
||||
- **RStudio**:R语言集成环境
|
||||
- **JupyterLab**:交互式笔记本环境
|
||||
|
||||
|
||||
|
||||
## Git版本控制
|
||||
|
||||
```bash
|
||||
# 基本工作流
|
||||
git clone https://repo.url
|
||||
git add .
|
||||
git commit -m "message"
|
||||
git push
|
||||
|
||||
# 分支管理
|
||||
git checkout -b new-feature
|
||||
git merge main
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- 团队协作开发
|
||||
- 版本回滚与问题追踪
|
||||
|
||||
|
||||
## MySQL数据库
|
||||
|
||||
```sql
|
||||
-- 基本查询
|
||||
SELECT * FROM users WHERE age > 18;
|
||||
|
||||
-- 创建表
|
||||
CREATE TABLE products (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
-- 数据操作
|
||||
INSERT INTO products VALUES (1, 'Laptop');
|
||||
UPDATE products SET price=999 WHERE id=1;
|
||||
```
|
||||
|
||||
**案例**:
|
||||
- Web应用数据存储
|
||||
- 数据分析与报表生成
|
||||
|
||||
|
||||
|
||||
# 公开数据获取
|
||||
|
||||
## 案例:全国气象数据
|
||||
|
||||
```{bash}
|
||||
#| eval: false
|
||||
#!/bin/bash
|
||||
logfn="${HOME}/service/log/nationalairquality/nationalairquality.log"
|
||||
workdirfn="${HOME}/service/nationalairquality/"
|
||||
mkdir -p "$(dirname "${logfn}")"
|
||||
touch "${logfn}"
|
||||
|
||||
echo "$(date '+%Y-%m-%d %H:%M:%S'): 下载大气质量数据" >>"${logfn}"
|
||||
|
||||
declare -a citynames
|
||||
|
||||
citynames=(北京市 石家庄市 秦皇岛市)
|
||||
|
||||
jsonfn="${workdirfn}/nationalairquality_$(date '+%Y%d%m%H').json"
|
||||
jsonfn="nationalairquality_$(date '+%Y%d%m%H').json"
|
||||
# [[ -f "${jsonfn}" ]] && rm "${jsonfn}"
|
||||
echo "[" >"${jsonfn}"
|
||||
|
||||
for cityname in "${citynames[@]}"; do
|
||||
echo "下载${cityname}空气质量数据..." >>"${logfn}"
|
||||
curl "https://air.cnemc.cn:18007/CityData/GetAQIDataPublishLive?cityName=${cityname}" \
|
||||
-H 'Accept: */*' \
|
||||
-H 'Accept-Language: en-US,en;q=0.9' \
|
||||
-H 'Connection: keep-alive' \
|
||||
-H 'Referer: https://air.cnemc.cn:18007/' \
|
||||
-H 'Sec-Fetch-Dest: empty' \
|
||||
-H 'Sec-Fetch-Mode: cors' \
|
||||
-H 'Sec-Fetch-Site: same-origin' \
|
||||
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36' \
|
||||
-H 'X-Requested-With: XMLHttpRequest' \
|
||||
-H 'sec-ch-ua: "Google Chrome";v="119", "Chromium";v="119", "Not?A_Brand";v="24"' \
|
||||
-H 'sec-ch-ua-mobile: ?0' \
|
||||
-H 'sec-ch-ua-platform: "macOS"' \
|
||||
--compressed --silent | jq | sed -e '1d' -e '$d' -e 's/^\( *}\)$/\1,/' >>"${jsonfn}"
|
||||
done
|
||||
sed -i '' '$s/},/}/' "${jsonfn}"
|
||||
echo "]" >>"${jsonfn}"
|
||||
|
||||
echo "大气质量数据下载完成!" >>"${logfn}"
|
||||
echo "开始上传数据库..." >>"${logfn}"
|
||||
/home/ming/bin/getnationalairquality.R "${jsonfn}"
|
||||
echo "上传数据库完毕!" >>"${logfn}"
|
||||
```
|
||||
|
||||
## R代码
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
#| echo: true
|
||||
|
||||
#!/usr/bin/Rscript
|
||||
|
||||
jsonfn <- commandArgs(TRUE)[1]
|
||||
# jsonfn <- "~/nationalairquality_2023191219.json"
|
||||
jsondf <- jsonlite::fromJSON(jsonfn, flatten = TRUE)
|
||||
# metadf <- tibble::as_tibble(jsondf) |>
|
||||
# dplyr::select(site = StationCode, name = PositionName, Area, lon = Longitude, lat = Latitude) |>
|
||||
# dplyr::mutate(lon = as.numeric(lon), lat = as.numeric(lat))
|
||||
# DBI::dbWriteTable(conn, "metadf", metadf, overwrite = TRUE, row.names = FALSE)
|
||||
airqualitydf <- tibble::as_tibble(jsondf) |>
|
||||
dplyr::select(
|
||||
datetime = TimePoint,
|
||||
site = StationCode,
|
||||
`CO_mg/m3` = CO,
|
||||
`CO_24h_mg/m3` = CO_24h,
|
||||
`NO2_μg/m3` = NO2,
|
||||
`NO2_24h_μg/m3` = NO2_24h,
|
||||
`O3_μg/m3` = O3,
|
||||
`O3_24h_μg/m3` = O3_24h,
|
||||
`O3_8h_μg/m3` = O3_8h,
|
||||
`O3_8h_24h_μg/m3` = O3_8h_24h,
|
||||
`PM10_μg/m3` = PM10,
|
||||
`PM10_24h_μg/m3` = PM10_24h,
|
||||
`PM2.5_μg/m3` = PM2_5,
|
||||
`PM2.5_24h_μg/m3` = PM2_5_24h,
|
||||
`SO2_μg/m3` = SO2,
|
||||
`SO2_24h_μg/m3` = SO2_24h,
|
||||
`NO_μg/m3` = NO,
|
||||
`NO_24h_μg/m3` = NO_24h,
|
||||
`NOx_μg/m3` = NOx,
|
||||
`NOx_24h_μg/m3` = NOx_24h,
|
||||
AQI,
|
||||
COLevel,
|
||||
NO2Level,
|
||||
O3Level,
|
||||
O3_8hLevel,
|
||||
PM10Level,
|
||||
PM2_5Level,
|
||||
SO2Level,
|
||||
PrimaryPollutant,
|
||||
Quality,
|
||||
Unheathful
|
||||
) |>
|
||||
dplyr::mutate(dplyr::across(
|
||||
`CO_mg/m3`:AQI,
|
||||
~ round(readr::parse_number(.x), 4)
|
||||
)) |>
|
||||
dplyr::mutate(datetime = lubridate::as_datetime(datetime))
|
||||
conn <- cctdb::get_dbconn("nationalairquality", writepermission = TRUE)
|
||||
DBI::dbWriteTable(
|
||||
conn,
|
||||
"airqualitydf",
|
||||
airqualitydf,
|
||||
append = TRUE,
|
||||
row.names = FALSE
|
||||
)
|
||||
DBI::dbDisconnect(conn)
|
||||
```
|
||||
|
||||
|
||||
|
||||
## 欢迎讨论!{.center}
|
||||
|
||||
|
||||
`r rmdify::slideend(wechat = FALSE, type = "public", tel = FALSE, thislink = "https://drc.drwater.net/course/public/RWEP/PUB/SD/")`
|
||||
|
||||
|
||||
|
||||
1765
SD/5.3_大数据分析工具/nationalairquality_2025090420.json
Normal file
1765
SD/5.3_大数据分析工具/nationalairquality_2025090420.json
Normal file
File diff suppressed because it is too large
Load Diff
1765
SD/5.3_大数据分析工具/nationalairquality_2025090422.json
Normal file
1765
SD/5.3_大数据分析工具/nationalairquality_2025090422.json
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user