rpki/monitor/README.md

221 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ours RP Prometheus / Grafana Monitor
本目录提供本地开发监控栈,用于采集 `rpki_artifact_metrics` 暴露的 ours RP soak 指标。
## 前置条件
1. Docker + Docker Compose v2
2. 宿主机已启动 `rpki_artifact_metrics`,并监听 Docker 网桥可访问的地址,例如 `0.0.0.0:9556`
3. Prometheus 容器通过 `host.docker.internal:9556` 访问宿主 sidecar。
Linux Docker 下 compose 已配置:
```yaml
extra_hosts:
- host.docker.internal:host-gateway
```
## 启动
```bash
cd rpki_2/rpki/monitor
docker compose up -d
```
默认镜像使用官方 Docker Hub 镜像:
```text
prom/prometheus:v2.55.1
grafana/grafana:11.3.1
```
如需切到其它镜像源:
```bash
PROMETHEUS_IMAGE=<mirror>/prom/prometheus:v2.55.1 \
GRAFANA_IMAGE=<mirror>/grafana/grafana:11.3.1 \
docker compose up -d
```
默认端口:
- Prometheus: <http://localhost:9090>
- Grafana: <http://localhost:3000>
- Grafana 默认账号密码:`admin` / `admin`
如端口冲突:
```bash
PROMETHEUS_PORT=19090 GRAFANA_PORT=13000 docker compose up -d
```
Prometheus 默认保留 7 天数据;可通过 `PROMETHEUS_RETENTION` 覆盖:
```bash
PROMETHEUS_RETENTION=7d docker compose up -d
```
## 长期稳定性测试
portable soak package 内置 `run_24h_soak_with_metrics.sh`,用于连续运行 ours RP、启动 metrics sidecar、启动本监控栈并每小时生成报告
```bash
cd /path/to/portable-soak
SOAK_DURATION_SECS=0 \
HOURLY_REPORT_INTERVAL_SECS=3600 \
SOAK_RETAIN_RUNS=100 \
CLEAN_TMP_AFTER_RUN=1 \
PROMETHEUS_RETENTION=7d \
STOP_MONITOR_STACK_ON_EXIT=0 \
FEISHU_WEBHOOK_SCRIPT=/home/yuyr/.codex/skills/user/feishu-webhook/scripts/send_feishu_text.py \
./run_24h_soak_with_metrics.sh
```
`SOAK_DURATION_SECS=0` 表示持续运行不自动停止;如需 24 小时自然停止,可设置为 `86400`,脚本会等当前 run 完成后退出,不会直接 kill 半轮验证。
关键产物:
- `runs/run_xxxx/`:最近 100 个 run 原始产物;
- `hourly_reports/hour_*.md`:小时级报告;
- `hourly_reports/hourly_summary.jsonl`:小时级结构化汇总;
- `incident_runs/run_xxxx/`:异常 run 固化副本;
- `logs/metrics.*``logs/24h-soak.*``logs/hourly-reporter.*`:运行日志。
短周期联调可把 `SOAK_DURATION_SECS``HOURLY_REPORT_INTERVAL_SECS` 调小,并设置 `FEISHU_DRY_RUN=1` 避免真实飞书通知。
## 停止
```bash
cd rpki_2/rpki/monitor
docker compose down
```
保留数据 volume。若要清理数据
```bash
docker compose down -v
```
## 典型本地联调命令
先启动 APNIC soak 和 metrics sidecar例如
```bash
# soak .env 关键配置
MAX_RUNS=-1
RIRS=apnic
RETAIN_RUNS=5
INTERVAL_SECS=0
# metrics sidecar
rpki_artifact_metrics \
--run-root /path/to/portable-soak \
--listen 0.0.0.0:9556 \
--poll-secs 5 \
--instance local-apnic-continuous
```
再启动监控栈:
```bash
cd rpki_2/rpki/monitor
docker compose up -d
```
## 验证
Prometheus target
```bash
curl -s 'http://localhost:9090/api/v1/targets' | python3 -m json.tool
```
Prometheus query
```bash
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=up{job="ours-rp-artifact-metrics"}'
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=ours_rp_run_completed_total{status="success"}'
```
Grafana health
```bash
curl -s http://localhost:3000/api/health | python3 -m json.tool
```
Grafana dashboard
- 打开 <http://localhost:3000/d/ours-rp-soak-overview/ours-rp-soak-overview>
## 主要指标
- `ours_rp_metrics_service_up`
- `ours_rp_run_completed_total`
- `ours_rp_run_duration_seconds`
- `ours_rp_run_max_rss_bytes`
- `ours_rp_vrps{kind="total|unique"}``total` 为去重前 VRP 条目数,`unique``(ASN, IP Prefix, Max Length)` 去重。
- `ours_rp_vaps`
- `ours_rp_publication_points`
- `ours_rp_repo_sync_phase_count`
- `ours_rp_large_publication_points{object_count_gt="10|50|100|..."}`
- `ours_rp_cir_objects`
- `ours_rp_ccr_state_items`
## Inter-RP 持续对比监控
`rpki_inter_rp_metrics` 用于汇总三方 RP 的最新产物:
- ours RP读取当前 portable soak 的 `runs/run_xxxx/run-summary.json``result.ccr`、CSV 产物;
- Routinator读取远端200同步来的 `routinator/latest/run-meta.json``vrps.csv``vaps.csv`
- rpki-client 9.8读取远端200同步来的 `rpki-client/latest/run-meta.json``vrps.csv``vaps.csv``result.ccr`
远端231 启动 sidecar 示例:
```bash
rpki_inter_rp_metrics \
--ours-run-root /root/rpki_20260608_2_feature062_24h_20260608T075547Z/portable-soak \
--peer-root /root/inter-rp-aggregator/synced-from-200 \
--listen 0.0.0.0:9557 \
--poll-secs 30 \
--instance remote231-inter-rp
```
Prometheus 已新增 `ours-rp-inter-rp-metrics` scrape job默认访问 `host.docker.internal:9557`
远端200 runner 与远端231同步脚本位于
```text
scripts/inter_rp/run_remote200_rp_loops.sh
scripts/inter_rp/run_single_rp_with_rss.sh
scripts/inter_rp/sync_remote200_to_231.sh
scripts/inter_rp/run_inter_rp_metrics_sidecar.sh
scripts/inter_rp/inter-rp.env.example
```
如需从本机独立开关远端200上的 Routinator 或 rpki-client使用
```bash
scripts/inter_rp/control_remote200_rp.sh status all
scripts/inter_rp/control_remote200_rp.sh stop routinator
scripts/inter_rp/control_remote200_rp.sh start routinator
scripts/inter_rp/control_remote200_rp.sh restart rpki-client
```
默认远端为 `root@43.110.128.200`,可通过 `REMOTE_HOST=...` 覆盖;脚本只管理指定 RP 的 loop 和当前子进程,不会自动影响另一个 RP。
关键指标:
- `inter_rp_run_wall_seconds{rp="ours-rp|routinator|rpki-client"}`
- `inter_rp_run_max_rss_bytes{rp="...",kind="aggregate_peak"}`
- `inter_rp_vrps{rp="..."}`:按 `(ASN, IP Prefix, Max Length)` 去重。
- `inter_rp_vaps{rp="..."}`:按 `(Customer ASN, Providers)` 去重Routinator 使用 `--enable-aspa` JSON 输出转换rpki-client 使用 `-j` JSON 输出转换。
- `inter_rp_ccr_digest_match{left="ours-rp",right="rpki-client",state="overall|mfts|vrps|vaps|tas|rks"}`
- `inter_rp_sync_age_seconds`
Grafana dashboard
- <http://localhost:3000/d/ours-rp-inter-rp/ours-rp-inter-rp>