221 lines
6.1 KiB
Markdown
221 lines
6.1 KiB
Markdown
# Ours RP Prometheus / Grafana Monitor
|
||
|
||
本目录提供本地开发监控栈,用于采集 `rpki_artifact_metrics` 暴露的 ours RP soak 指标。
|
||
|
||
## 前置条件
|
||
|
||
1. Docker + Docker Compose v2;
|
||
2. 宿主机已启动 `rpki_artifact_metrics`,并监听 Docker 网桥可访问的地址,例如 `0.0.0.0:9556`;
|
||
3. Prometheus 容器通过 `host.docker.internal:9556` 访问宿主 sidecar。
|
||
|
||
Linux Docker 下 compose 已配置:
|
||
|
||
```yaml
|
||
extra_hosts:
|
||
- host.docker.internal:host-gateway
|
||
```
|
||
|
||
## 启动
|
||
|
||
```bash
|
||
cd rpki_2/rpki/monitor
|
||
docker compose up -d
|
||
```
|
||
|
||
默认镜像使用官方 Docker Hub 镜像:
|
||
|
||
```text
|
||
prom/prometheus:v2.55.1
|
||
grafana/grafana:11.3.1
|
||
```
|
||
|
||
如需切到其它镜像源:
|
||
|
||
```bash
|
||
PROMETHEUS_IMAGE=<mirror>/prom/prometheus:v2.55.1 \
|
||
GRAFANA_IMAGE=<mirror>/grafana/grafana:11.3.1 \
|
||
docker compose up -d
|
||
```
|
||
|
||
默认端口:
|
||
|
||
- Prometheus: <http://localhost:9090>
|
||
- Grafana: <http://localhost:3000>
|
||
- Grafana 默认账号密码:`admin` / `admin`
|
||
|
||
如端口冲突:
|
||
|
||
```bash
|
||
PROMETHEUS_PORT=19090 GRAFANA_PORT=13000 docker compose up -d
|
||
```
|
||
|
||
Prometheus 默认保留 7 天数据;可通过 `PROMETHEUS_RETENTION` 覆盖:
|
||
|
||
```bash
|
||
PROMETHEUS_RETENTION=7d docker compose up -d
|
||
```
|
||
|
||
## 长期稳定性测试
|
||
|
||
portable soak package 内置 `run_24h_soak_with_metrics.sh`,用于连续运行 ours RP、启动 metrics sidecar、启动本监控栈,并每小时生成报告:
|
||
|
||
```bash
|
||
cd /path/to/portable-soak
|
||
SOAK_DURATION_SECS=0 \
|
||
HOURLY_REPORT_INTERVAL_SECS=3600 \
|
||
SOAK_RETAIN_RUNS=100 \
|
||
CLEAN_TMP_AFTER_RUN=1 \
|
||
PROMETHEUS_RETENTION=7d \
|
||
STOP_MONITOR_STACK_ON_EXIT=0 \
|
||
FEISHU_WEBHOOK_SCRIPT=/home/yuyr/.codex/skills/user/feishu-webhook/scripts/send_feishu_text.py \
|
||
./run_24h_soak_with_metrics.sh
|
||
```
|
||
|
||
`SOAK_DURATION_SECS=0` 表示持续运行不自动停止;如需 24 小时自然停止,可设置为 `86400`,脚本会等当前 run 完成后退出,不会直接 kill 半轮验证。
|
||
|
||
关键产物:
|
||
|
||
- `runs/run_xxxx/`:最近 100 个 run 原始产物;
|
||
- `hourly_reports/hour_*.md`:小时级报告;
|
||
- `hourly_reports/hourly_summary.jsonl`:小时级结构化汇总;
|
||
- `incident_runs/run_xxxx/`:异常 run 固化副本;
|
||
- `logs/metrics.*`、`logs/24h-soak.*`、`logs/hourly-reporter.*`:运行日志。
|
||
|
||
短周期联调可把 `SOAK_DURATION_SECS` 和 `HOURLY_REPORT_INTERVAL_SECS` 调小,并设置 `FEISHU_DRY_RUN=1` 避免真实飞书通知。
|
||
|
||
## 停止
|
||
|
||
```bash
|
||
cd rpki_2/rpki/monitor
|
||
docker compose down
|
||
```
|
||
|
||
保留数据 volume。若要清理数据:
|
||
|
||
```bash
|
||
docker compose down -v
|
||
```
|
||
|
||
## 典型本地联调命令
|
||
|
||
先启动 APNIC soak 和 metrics sidecar,例如:
|
||
|
||
```bash
|
||
# soak .env 关键配置
|
||
MAX_RUNS=-1
|
||
RIRS=apnic
|
||
RETAIN_RUNS=5
|
||
INTERVAL_SECS=0
|
||
|
||
# metrics sidecar
|
||
rpki_artifact_metrics \
|
||
--run-root /path/to/portable-soak \
|
||
--listen 0.0.0.0:9556 \
|
||
--poll-secs 5 \
|
||
--instance local-apnic-continuous
|
||
```
|
||
|
||
再启动监控栈:
|
||
|
||
```bash
|
||
cd rpki_2/rpki/monitor
|
||
docker compose up -d
|
||
```
|
||
|
||
## 验证
|
||
|
||
Prometheus target:
|
||
|
||
```bash
|
||
curl -s 'http://localhost:9090/api/v1/targets' | python3 -m json.tool
|
||
```
|
||
|
||
Prometheus query:
|
||
|
||
```bash
|
||
curl -G 'http://localhost:9090/api/v1/query' \
|
||
--data-urlencode 'query=up{job="ours-rp-artifact-metrics"}'
|
||
|
||
curl -G 'http://localhost:9090/api/v1/query' \
|
||
--data-urlencode 'query=ours_rp_run_completed_total{status="success"}'
|
||
```
|
||
|
||
Grafana health:
|
||
|
||
```bash
|
||
curl -s http://localhost:3000/api/health | python3 -m json.tool
|
||
```
|
||
|
||
Grafana dashboard:
|
||
|
||
- 打开 <http://localhost:3000/d/ours-rp-soak-overview/ours-rp-soak-overview>
|
||
|
||
## 主要指标
|
||
|
||
- `ours_rp_metrics_service_up`
|
||
- `ours_rp_run_completed_total`
|
||
- `ours_rp_run_duration_seconds`
|
||
- `ours_rp_run_max_rss_bytes`
|
||
- `ours_rp_vrps{kind="total|unique"}`:`total` 为去重前 VRP 条目数,`unique` 按 `(ASN, IP Prefix, Max Length)` 去重。
|
||
- `ours_rp_vaps`
|
||
- `ours_rp_publication_points`
|
||
- `ours_rp_repo_sync_phase_count`
|
||
- `ours_rp_large_publication_points{object_count_gt="10|50|100|..."}`
|
||
- `ours_rp_cir_objects`
|
||
- `ours_rp_ccr_state_items`
|
||
|
||
## Inter-RP 持续对比监控
|
||
|
||
`rpki_inter_rp_metrics` 用于汇总三方 RP 的最新产物:
|
||
|
||
- ours RP:读取当前 portable soak 的 `runs/run_xxxx/run-summary.json`、`result.ccr`、CSV 产物;
|
||
- Routinator:读取远端200同步来的 `routinator/latest/run-meta.json`、`vrps.csv`、`vaps.csv`;
|
||
- rpki-client 9.8:读取远端200同步来的 `rpki-client/latest/run-meta.json`、`vrps.csv`、`vaps.csv`、`result.ccr`。
|
||
|
||
远端231 启动 sidecar 示例:
|
||
|
||
```bash
|
||
rpki_inter_rp_metrics \
|
||
--ours-run-root /root/rpki_20260608_2_feature062_24h_20260608T075547Z/portable-soak \
|
||
--peer-root /root/inter-rp-aggregator/synced-from-200 \
|
||
--listen 0.0.0.0:9557 \
|
||
--poll-secs 30 \
|
||
--instance remote231-inter-rp
|
||
```
|
||
|
||
Prometheus 已新增 `ours-rp-inter-rp-metrics` scrape job,默认访问 `host.docker.internal:9557`。
|
||
|
||
远端200 runner 与远端231同步脚本位于:
|
||
|
||
```text
|
||
scripts/inter_rp/run_remote200_rp_loops.sh
|
||
scripts/inter_rp/run_single_rp_with_rss.sh
|
||
scripts/inter_rp/sync_remote200_to_231.sh
|
||
scripts/inter_rp/run_inter_rp_metrics_sidecar.sh
|
||
scripts/inter_rp/inter-rp.env.example
|
||
```
|
||
|
||
如需从本机独立开关远端200上的 Routinator 或 rpki-client,使用:
|
||
|
||
```bash
|
||
scripts/inter_rp/control_remote200_rp.sh status all
|
||
scripts/inter_rp/control_remote200_rp.sh stop routinator
|
||
scripts/inter_rp/control_remote200_rp.sh start routinator
|
||
scripts/inter_rp/control_remote200_rp.sh restart rpki-client
|
||
```
|
||
|
||
默认远端为 `root@43.110.128.200`,可通过 `REMOTE_HOST=...` 覆盖;脚本只管理指定 RP 的 loop 和当前子进程,不会自动影响另一个 RP。
|
||
|
||
关键指标:
|
||
|
||
- `inter_rp_run_wall_seconds{rp="ours-rp|routinator|rpki-client"}`
|
||
- `inter_rp_run_max_rss_bytes{rp="...",kind="aggregate_peak"}`
|
||
- `inter_rp_vrps{rp="..."}`:按 `(ASN, IP Prefix, Max Length)` 去重。
|
||
- `inter_rp_vaps{rp="..."}`:按 `(Customer ASN, Providers)` 去重,Routinator 使用 `--enable-aspa` JSON 输出转换,rpki-client 使用 `-j` JSON 输出转换。
|
||
- `inter_rp_ccr_digest_match{left="ours-rp",right="rpki-client",state="overall|mfts|vrps|vaps|tas|rks"}`
|
||
- `inter_rp_sync_age_seconds`
|
||
|
||
Grafana dashboard:
|
||
|
||
- <http://localhost:3000/d/ours-rp-inter-rp/ours-rp-inter-rp>
|