[#13] 增加系统集成测试README

This commit is contained in:
yuyr 2025-10-10 09:20:55 +00:00
parent 797e6a38ab
commit 6451c32d6b
7 changed files with 127 additions and 10 deletions

119
src/sys/tests/README.md Normal file
View File

@ -0,0 +1,119 @@
# ARGUS 系统级端到端测试Sys E2E
本目录包含将 log 与 agent 两线验证合并后的系统级端到端测试。依赖 bind/master/es/kibana + 两个“日志节点”(每个节点容器内同时运行 Fluent Bit 与 argus-agent
---
## 一、如何运行
- 前置条件
- 已构建镜像:`argus-elasticsearch:latest``argus-kibana:latest``argus-bind9:latest``argus-master:latest`
- 可用根目录命令构建:`./build/build_images.sh [--intranet]`
- 主机具备 Docker 与 Docker Compose。
- 一键执行
- `cd src/sys/tests`
- `./scripts/00_e2e_test.sh`
- 分步执行(推荐用于排查)
- `./scripts/01_bootstrap.sh` 生成目录/拷贝 `update-dns.sh`/构建 agent 二进制/写 `.env`
- `./scripts/02_up.sh` 启动 Compose 栈(工程名 `argus-sys`
- `./scripts/03_wait_ready.sh` 等待 ES/Kibana/Master/FluentBit/Bind 就绪Kibana 必须返回 200 且 overall.level=available
- `./scripts/04_verify_dns_routing.sh` 校验 bind 解析与节点内域名解析
- `./scripts/05_agent_register.sh` 获取两个节点的 `node_id` 与初始 IP检查本地 `node.json`
- `./scripts/06_write_health_and_assert.sh` 写健康文件并断言 `nodes.json` 仅包含 2 个在线节点
- `./scripts/07_logs_send_and_assert.sh` 向两个节点写日志,断言 ES `train-*`/`infer-*` 计数增长
- `./scripts/08_restart_agent_reregister.sh` `node-b` 改为固定 IP `172.29.0.200`,验证保持同一节点 ID 且 IP/时间戳更新
- `./scripts/09_down.sh` 回收容器、网络并清理 `private*/``tmp/`
- 重置环境
- 任何阶段失败可执行 `./scripts/09_down.sh` 后重跑 `01→…`
---
## 二、测试部署架构docker-compose
- 网络
- 自定义 bridge`argus-sys-net`,子网 `172.29.0.0/16`
- 固定地址bind=`172.29.0.2`master=`172.29.0.10`
- 服务与端口
- `bind``argus-bind9:latest`):监听 53/tcp+udp负责同步 `*.argus.com` 记录
- `master``argus-master:latest`):对外 `32300→3000`API `http://localhost:32300`
- `es``argus-elasticsearch:latest``9200→9200`;单节点,无安全
- `kibana``argus-kibana:latest``5601→5601`;通过 `ELASTICSEARCH_HOSTS=http://es:9200` 访问 ES
- `node-a``ubuntu:22.04`):同时运行 Fluent Bit + argus-agent`hostname=dev-yyrshare-nbnyx10-cp2f-pod-0``2020→2020`
- `node-b``ubuntu:22.04`):同时运行 Fluent Bit + argus-agent`hostname=dev-yyrshare-uuuu10-ep2f-pod-0``2021→2020`
- 卷与目录
- 核心服务bind/master/es/kibana共享宿主 `./private` 挂载到容器 `/private`
- 两个节点使用独立数据卷,互不与核心服务混用:
- node-a`./private-nodea/argus/agent/<HOST> → /private/argus/agent/<HOST>`
- node-b`./private-nodeb/argus/agent/<HOST> → /private/argus/agent/<HOST>`
- 节点容器的 Fluent Bit/agent 资产以只读方式挂载到 `/assets`/`/usr/local/bin/argus-agent`
- DNS 配置
- 节点容器通过 compose 配置 `dns: [172.29.0.2]` 指向 bind不挂载 `/etc/resolv.conf`,也不依赖 `update-dns.sh`
- master/es/kibana 仍共享 `./private`master 启动会写 `/private/argus/etc/master.argus.com` 供 bind 同步 A 记录
- 节点入口
- `scripts/node_entrypoint.sh`
- 复制 `/assets/fluent-bit/*` 到容器 `/private`,后台启动 Fluent Bit监听 2020
- 以运行用户(映射 UID/GID前台启动 `argus-agent`
- 节点环境变量:`MASTER_ENDPOINT=http://master.argus.com:3000``REPORT_INTERVAL_SECONDS=2``ES_HOST=es``ES_PORT=9200``CLUSTER=local``RACK=dev`
---
## 三、脚本与验证目标
- `01_bootstrap.sh`
- 目的:准备目录结构、修正 ES/Kibana 数据目录属主、分发 `update-dns.sh`(仅核心服务使用)、构建 agent 二进制、写 `.env`
- 失败排查:若 ES 无法写入数据,重跑本步骤确保目录属主为指定 UID/GID
- `02_up.sh`
- 目的:以工程名 `argus-sys` 启动全栈;自动清理旧栈/网络
- `03_wait_ready.sh`
- 目的:等待关键端口/健康接口可用
- 判定:
- ES `/_cluster/health?wait_for_status=yellow` 成功
- Kibana `GET /api/status` 返回 200 且 `overall.level=available`
- Master `/readyz` 成功
- Fluent Bit 指标接口 `:2020/:2021` 可访问
- bind `named-checkconf` 通过
- `04_verify_dns_routing.sh`
- 目的:验证从 bind → 节点容器的解析链路
- 判定:
- `private/argus/etc/master.argus.com` 存在且为 master IP
- 在 node-a/node-b 内 `getent hosts master.argus.com` 成功解析到 master IP
- `05_agent_register.sh`
- 目的:确认两个节点注册到 master 并持久化 `node.json`
- 输出:`tmp/node_id_a|b``tmp/initial_ip_a|b``tmp/detail_*.json`
- `06_write_health_and_assert.sh`
- 目的:模拟节点健康上报并在 master 侧可见;`nodes.json` 仅保留在线节点
- 操作:写 `log-fluentbit.json``metric-node-exporter.json` 至两个节点的 health 目录
- `07_logs_send_and_assert.sh`
- 目的:通过 Fluent Bit 将两类日志注入 ES计数应较基线增长且达到阈值≥4
- 同时校验 ES 健康 `green|yellow`
- `08_restart_agent_reregister.sh`
- 目的:验证节点重启与 IP 变更时保持相同 `id` 并更新 `meta_data.ip``last_updated`
- 操作:以固定 IP `172.29.0.200` 重建 nodeb 后轮询校验
- `09_down.sh`
- 目的:栈销毁与环境清理;必要时使用临时容器修正属主再删除 `private*` 目录
---
### 常见问题与排查
- Kibana 长时间 503机器较慢时初始化较久脚本最长等待 ~15 分钟;先确认 ES 已就绪。
- Fluent Bit 指标未就绪:检查节点容器日志与环境变量 `CLUSTER/RACK` 是否设置;确认入口脚本已经复制资产到 `/private`
- ES 无法启动:多为宿主目录权限问题;重跑 `01_bootstrap.sh`,或手动 `chown -R <UID:GID> src/sys/tests/private/argus/log/*`
---
如需更严格的断言(例如 Kibana 载入具体插件、ES 文档字段校验),可在 `07_*.sh` 中追加查询与校验逻辑。

View File

@ -124,7 +124,7 @@ services:
- CLUSTER=local - CLUSTER=local
- RACK=dev - RACK=dev
volumes: volumes:
- ./private-node2/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0:/private/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0 - ./private-nodeb/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0:/private/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0
- ../../agent/dist/argus-agent:/usr/local/bin/argus-agent:ro - ../../agent/dist/argus-agent:/usr/local/bin/argus-agent:ro
- ./scripts/node_entrypoint.sh:/usr/local/bin/node-entrypoint.sh:ro - ./scripts/node_entrypoint.sh:/usr/local/bin/node-entrypoint.sh:ro
- ../../log/fluent-bit/build/start-fluent-bit.sh:/assets/start-fluent-bit.sh:ro - ../../log/fluent-bit/build/start-fluent-bit.sh:/assets/start-fluent-bit.sh:ro

View File

@ -7,7 +7,7 @@ REPO_ROOT="$(cd "$TEST_ROOT/../../.." && pwd)"
PRIVATE_CORE="$TEST_ROOT/private" PRIVATE_CORE="$TEST_ROOT/private"
PRIVATE_NODEA="$TEST_ROOT/private-nodea" PRIVATE_NODEA="$TEST_ROOT/private-nodea"
PRIVATE_NODEB="$TEST_ROOT/private-node2" PRIVATE_NODEB="$TEST_ROOT/private-nodeb"
TMP_DIR="$TEST_ROOT/tmp" TMP_DIR="$TEST_ROOT/tmp"
source "$REPO_ROOT/scripts/common/build_user.sh" source "$REPO_ROOT/scripts/common/build_user.sh"

View File

@ -79,7 +79,7 @@ pathlib.Path(sys.argv[2]).write_text(ip)
PY PY
NODE_JSON_A="$TEST_ROOT/private-nodea/argus/agent/$HOST_A/node.json" NODE_JSON_A="$TEST_ROOT/private-nodea/argus/agent/$HOST_A/node.json"
NODE_JSON_B="$TEST_ROOT/private-node2/argus/agent/$HOST_B/node.json" NODE_JSON_B="$TEST_ROOT/private-nodeb/argus/agent/$HOST_B/node.json"
[[ -f "$NODE_JSON_A" ]] || { echo "[ERR] node.json missing for $HOST_A" >&2; exit 1; } [[ -f "$NODE_JSON_A" ]] || { echo "[ERR] node.json missing for $HOST_A" >&2; exit 1; }
[[ -f "$NODE_JSON_B" ]] || { echo "[ERR] node.json missing for $HOST_B" >&2; exit 1; } [[ -f "$NODE_JSON_B" ]] || { echo "[ERR] node.json missing for $HOST_B" >&2; exit 1; }

View File

@ -11,7 +11,7 @@ HOST_A="dev-yyrshare-nbnyx10-cp2f-pod-0"
HOST_B="dev-yyrshare-uuuu10-ep2f-pod-0" HOST_B="dev-yyrshare-uuuu10-ep2f-pod-0"
HEALTH_A="$TEST_ROOT/private-nodea/argus/agent/$HOST_A/health" HEALTH_A="$TEST_ROOT/private-nodea/argus/agent/$HOST_A/health"
HEALTH_B="$TEST_ROOT/private-node2/argus/agent/$HOST_B/health" HEALTH_B="$TEST_ROOT/private-nodeb/argus/agent/$HOST_B/health"
write_health() { write_health() {
local dir="$1"; mkdir -p "$dir" local dir="$1"; mkdir -p "$dir"

View File

@ -61,7 +61,7 @@ docker run -d \
-e ES_HOST=es \ -e ES_HOST=es \
-e ES_PORT=9200 \ -e ES_PORT=9200 \
-p 2021:2020 \ -p 2021:2020 \
-v "$TEST_ROOT/private-node2/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0:/private/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0" \ -v "$TEST_ROOT/private-nodeb/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0:/private/argus/agent/dev-yyrshare-uuuu10-ep2f-pod-0" \
-v "$AGENT_BIN_PATH:/usr/local/bin/argus-agent:ro" \ -v "$AGENT_BIN_PATH:/usr/local/bin/argus-agent:ro" \
-v "$SCRIPT_DIR/node_entrypoint.sh:/usr/local/bin/node-entrypoint.sh:ro" \ -v "$SCRIPT_DIR/node_entrypoint.sh:/usr/local/bin/node-entrypoint.sh:ro" \
-v "$REPO_ROOT/src/log/fluent-bit/build/start-fluent-bit.sh:/assets/start-fluent-bit.sh:ro" \ -v "$REPO_ROOT/src/log/fluent-bit/build/start-fluent-bit.sh:/assets/start-fluent-bit.sh:ro" \
@ -92,4 +92,3 @@ done
echo "[ERR] node-b did not update to IP 172.29.0.200 in time" >&2 echo "[ERR] node-b did not update to IP 172.29.0.200 in time" >&2
exit 1 exit 1

View File

@ -27,12 +27,11 @@ if [[ -d "$TEST_ROOT/private-nodea" ]]; then
docker run --rm -v "$TEST_ROOT/private-nodea:/target" ubuntu:24.04 chown -R "$(id -u):$(id -g)" /target >/dev/null 2>&1 || true docker run --rm -v "$TEST_ROOT/private-nodea:/target" ubuntu:24.04 chown -R "$(id -u):$(id -g)" /target >/dev/null 2>&1 || true
rm -rf "$TEST_ROOT/private-nodea" rm -rf "$TEST_ROOT/private-nodea"
fi fi
if [[ -d "$TEST_ROOT/private-node2" ]]; then if [[ -d "$TEST_ROOT/private-nodeb" ]]; then
docker run --rm -v "$TEST_ROOT/private-node2:/target" ubuntu:24.04 chown -R "$(id -u):$(id -g)" /target >/dev/null 2>&1 || true docker run --rm -v "$TEST_ROOT/private-nodeb:/target" ubuntu:24.04 chown -R "$(id -u):$(id -g)" /target >/dev/null 2>&1 || true
rm -rf "$TEST_ROOT/private-node2" rm -rf "$TEST_ROOT/private-nodeb"
fi fi
rm -rf "$TEST_ROOT/tmp" "$TEST_ROOT/.env" || true rm -rf "$TEST_ROOT/tmp" "$TEST_ROOT/.env" || true
echo "[OK] Cleaned up system E2E" echo "[OK] Cleaned up system E2E"