Compare commits

..

26 Commits

Author SHA1 Message Date
9b9fade833 [#33] 使用新构建agent二进制发布包 2025-10-28 17:35:02 +08:00
824eddde67 [#31] grafana 面板uid冲突问题解决 2025-10-28 17:13:42 +08:00
d036da2d5e [#34] 增加sys/tests节点镜像构建 2025-10-28 17:12:42 +08:00
8e01264e3f [#29] 支持gpu节点系统集成测试 2025-10-28 14:12:49 +08:00
26c39604d5 [#29] client-plugin all in one full加到仓库,且对于plugin bin加到lfs存储 2025-10-27 17:00:03 +08:00
bd082866d8 [#29] docker compose增加首页超链接支持动态后端服务映射端口 2025-10-27 16:10:02 +08:00
cc0f9e5fed [#29] lm2测试非GPU节点通过,新增宿主机空闲端口检测,遗留网页超链接使用静态端口问题 2025-10-27 15:46:59 +08:00
23f0f4fca4 [#29] 跑通非gpu节点端到端测试 2025-10-24 17:20:45 +08:00
2c799f2c1e [#29] 完成proxy到web/grafana/prom/alert,遗留web 列表也信息获取不到 2025-10-24 12:21:33 +08:00
1d4208ed3c [#29]已经存在的镜像就不再拉了 2025-10-23 14:43:45 +08:00
a1cdd05950 [#29] 整合metric模块端到端测试跑通;整合web/alert模块镜像构建 2025-10-22 12:08:34 +08:00
0b1ccbd87f [#29] 暂时恢复sys test 2025-10-21 11:48:55 +08:00
1d9a8ec695 [#14] build 构建融合metric镜像,移除测试节点镜像,直接使用基础镜像 2025-10-21 11:20:45 +08:00
sundapeng.sdp
b0d451cbe7 refactor: metric e2e测试流程融合到 sys/tests 步骤中(test-gpu-node/check-service-installed);
refs #29
2025-10-21 09:31:29 +08:00
sundapeng.sdp
835e81282f refactor: metric e2e测试流程融合到 sys/tests 步骤中(bootstrap/up/publish/installer);
refs #29
2025-10-20 17:59:17 +08:00
sundapeng.sdp
c4582c99bc feat: metric e2e 启动服务中判断是否当前服务器存在可用GPU,如无则跳过 test-gpu-node 容器创建和安装包的测试;
refs #29
2025-10-20 15:30:07 +08:00
sundapeng.sdp
299765ed40 refactor: metric e2e docker-compose中移除构建参数;metric e2e启动服务脚本中移除构建逻辑,迁移至build模块;
refs #29
2025-10-20 14:58:12 +08:00
sundapeng.sdp
a8bbf2d6e9 refactor: metric 模块中所需镜像构建融合到 build 模块;
refs #29
2025-10-20 14:18:15 +08:00
d1b89c0cf6 dev_1.0.0_xuxt_2 更新反向代理,打包镜像,以及README文档 (#28)
Co-authored-by: xiuting.xu <xiutingxt.xu@gmail.com>
Reviewed-on: #28
Reviewed-by: yuyr <yuyr@zgclab.edu.cn>
Reviewed-by: huhy <husteryezi@163.com>
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
2025-10-20 09:45:32 +08:00
1a768bc837 dev_1.0.0_sundp_2 优化Argus-metric模块的e2e部署测试流程 (#27)
Co-authored-by: sundapeng.sdp <sundapeng@hashdata.cn>
Reviewed-on: #27
Reviewed-by: yuyr <yuyr@zgclab.edu.cn>
Reviewed-by: xuxt <xuxt@zgclab.edu.cn>
2025-10-17 17:15:55 +08:00
31ccb0b1b8 增加sys/debug 部署测试;agent dev/user/instance元信息提取优化;sys/tests 优化 (#26)
Reviewed-on: #26
Reviewed-by: xuxt <xuxt@zgclab.edu.cn>
Reviewed-by: huhy <husteryezi@163.com>
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
2025-10-16 17:16:07 +08:00
8fbe107ac9 dev_1.0.0_xuxt 完成web和alert模块开发,以及模块e2e测试 (#21)
Co-authored-by: xiuting.xu <xiutingxt.xu@gmail.com>
Reviewed-on: #21
Reviewed-by: huhy <husteryezi@163.com>
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
Reviewed-by: yuyr <yuyr@zgclab.edu.cn>
2025-10-14 10:20:45 +08:00
c098f1d3ce dev_1.0.0_sundp 完成Metric模块及模块e2e测试 (#18)
Co-authored-by: sundapeng.sdp <sundapeng@hashdata.cn>
Reviewed-on: #18
Reviewed-by: xuxt <xuxt@zgclab.edu.cn>
Reviewed-by: yuyr <yuyr@zgclab.edu.cn>
Reviewed-by: huhy <husteryezi@163.com>
2025-10-11 17:15:06 +08:00
1e5e91b193 dev_1.0.0_yuyr_2:重新提交 PR,增加 master/agent 以及系统集成测试 (#17)
Reviewed-on: #17
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
Reviewed-by: xuxt <xuxt@zgclab.edu.cn>
2025-10-11 15:04:46 +08:00
8a38d3d0b2 dev_1.0.0_yuyr 完成 log和bind模块开发部署测试 (#8)
- [x] 完成log模块镜像构建、本地端到端写日志——收集——查询流程;
- [x] 完成bind模块构建;
- [x] 内置域名IP自动更新脚本,使用 /private/argus/etc目录下文件进行同步,容器启动时自动写IP,定时任务刷新更新DNS服务器IP和DNS规则;

Co-authored-by: root <root@curious.host.com>
Reviewed-on: #8
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
2025-09-22 16:39:38 +08:00
26e1c964ed init project 2025-09-15 11:00:03 +08:00
134 changed files with 761 additions and 7082 deletions

View File

@ -1,150 +0,0 @@
# ARGUS 统一构建脚本使用说明build/build_images.sh
本目录提供单一入口脚本 `build/build_images.sh`,覆盖常见三类场景:
- 系统集成测试src/sys/tests
- Swarm 系统集成测试src/sys/swarm_tests
- 构建离线安装包deployment_newServer/ClientGPU
文档还说明 UID/GID 取值规则、镜像 tag 策略、常用参数与重试机制。
## 环境前置
- Docker Engine ≥ 20.10(建议 ≥ 23.x/24.x
- Docker Compose v2`docker compose` 子命令)
- 可选:内网构建镜像源(`--intranet`
## UID/GID 规则(用于容器内用户/卷属主)
- 非 pkg 构建core/master/metric/web/alert/sys/gpu_bundle/cpu_bundle
- 读取 `configs/build_user.local.conf``configs/build_user.conf`
- 可被环境变量覆盖:`ARGUS_BUILD_UID``ARGUS_BUILD_GID`
- pkg 构建(`--only server_pkg``--only client_pkg`
- 读取 `configs/build_user.pkg.conf`(优先)→ `build_user.local.conf``build_user.conf`
- 可被环境变量覆盖;
- CPU bundle 明确走“非 pkg”链不读取 `build_user.pkg.conf`)。
- 说明:仅依赖 UID/GID 的 Docker 层会因参数变动而自动重建,不同构建剖面不会“打错包”。
## 镜像 tag 策略
- 非 pkg 构建:默认输出 `:latest`
- `--only server_pkg`:所有镜像直接输出为 `:<VERSION>`(不覆盖 `:latest`)。
- `--only client_pkg`GPU bundle 仅输出 `:<VERSION>`(不覆盖 `:latest`)。
- `--only cpu_bundle`:默认仅输出 `:<VERSION>`;可加 `--tag-latest` 同时打 `:latest` 以兼容 swarm_tests 默认 compose。
## 不加 --only 的默认构建目标
不指定 `--only` 时,脚本会构建“基础镜像集合”(不含 bundle 与安装包):
- core`argus-elasticsearch:latest``argus-kibana:latest``argus-bind9:latest`
- master`argus-master:latest`(非 offline
- metric`argus-metric-ftp:latest``argus-metric-prometheus:latest``argus-metric-grafana:latest`
- web`argus-web-frontend:latest``argus-web-proxy:latest`
- alert`argus-alertmanager:latest`
- sys`argus-sys-node:latest``argus-sys-metric-test-node:latest``argus-sys-metric-test-gpu-node:latest`
说明:默认 tag 为 `:latest`UID/GID 走“非 pkg”链`build_user.local.conf → build_user.conf`,可被环境变量覆盖)。
## 通用参数
- `--intranet`:使用内网构建参数(各 Dockerfile 中按需启用)。
- `--no-cache`:禁用 Docker 层缓存。
- `--only <list>`:逗号分隔目标,例:`--only core,master,metric,web,alert`
- `--version YYMMDD`bundle/pkg 的日期标签(必填于 cpu_bundle/gpu_bundle/server_pkg/client_pkg
- `--client-semver X.Y.Z`allinonefull 客户端语义化版本(可选)。
- `--cuda VER`GPU bundle CUDA 基镜版本(默认 12.2.2)。
- `--tag-latest`CPU bundle 构建时同时打 `:latest`
## 自动重试
- 构建单镜像失败会自动重试(默认 3 次,间隔 5s
- 最后一次自动使用 `DOCKER_BUILDKIT=0` 再试,缓解 “failed to receive status: context canceled”。
- 可调:`ARGUS_BUILD_RETRIES``ARGUS_BUILD_RETRY_DELAY` 环境变量。
---
## 场景一系统集成测试src/sys/tests
构建用于系统级端到端测试的镜像(默认 `:latest`)。
示例:
```
# 构建核心与周边
./build/build_images.sh --only core,master,metric,web,alert,sys
```
产出:
- 本地镜像:`argus-elasticsearch:latest``argus-kibana:latest``argus-master:latest``argus-metric-ftp:latest``argus-metric-prometheus:latest``argus-metric-grafana:latest``argus-alertmanager:latest``argus-web-frontend:latest``argus-web-proxy:latest``argus-sys-node:latest` 等。
说明:
- UID/GID 读取 `build_user.local.conf → build_user.conf`(或环境变量覆盖)。
- sys/tests 的执行见 `src/sys/tests/README.md`
---
## 场景二Swarm 系统集成测试src/sys/swarm_tests
需要服务端镜像 + CPU 节点 bundle 镜像。
步骤:
1) 构建服务端镜像(默认 `:latest`
```
./build/build_images.sh --only core,master,metric,web,alert
```
2) 构建 CPU bundle直接 FROM ubuntu:22.04
```
# 仅版本 tag 输出
./build/build_images.sh --only cpu_bundle --version 20251114
# 若要兼容 swarm_tests 默认 latest
./build/build_images.sh --only cpu_bundle --version 20251114 --tag-latest
```
3) 运行 Swarm 测试
```
cd src/sys/swarm_tests
# 如未打 latest可先指定
export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:20251114
./scripts/01_server_up.sh
./scripts/02_wait_ready.sh
./scripts/03_nodes_up.sh
./scripts/04_metric_verify.sh # 验证 Prometheus/Grafana/nodes.json 与日志通路
./scripts/99_down.sh # 结束
```
产出:
- 本地镜像:`argus-*:latest``argus-sys-metric-test-node-bundle:20251114`(或 latest
- `swarm_tests/private-*`:运行态持久化文件。
说明:
- CPU bundle 构建用户走“非 pkg”链local.conf → conf
- `04_metric_verify.sh` 已内置 Fluent Bit 启动与配置修正逻辑,偶发未就绪可重跑一次即通过。
---
## 场景三构建离线安装包deployment_new
Server 与 ClientGPU 安装包均采用“版本直出”,只输出 `:<VERSION>` 标签,不改动 `:latest`
1) Server 包
```
./build/build_images.sh --only server_pkg --version 20251114
```
产出:
- 本地镜像:`argus-<模块>:20251114`(不触碰 latest
- 安装包:`deployment_new/artifact/server/20251114/``server_20251114.tar.gz`
- 包内包含:逐镜像 tar.gz、compose/.env.example、scriptsconfig/install/selfcheck/diagnose 等、docs、manifest/checksums。
2) ClientGPU 包
```
# 同步构建 GPU bundle仅 :<VERSION>,不触碰 latest并生成客户端包
./build/build_images.sh --only client_pkg --version 20251114 \\
--client-semver 1.44.0 --cuda 12.2.2
```
产出:
- 本地镜像:`argus-sys-metric-test-node-bundle-gpu:20251114`
- 安装包:`deployment_new/artifact/client_gpu/20251114/``client_gpu_20251114.tar.gz`
- 包内包含GPU bundle 镜像 tar.gz、busybox.tar、compose/.env.example、scriptsconfig/install/uninstall、docs、manifest/checksums。
说明:
- pkg 构建使用 `configs/build_user.pkg.conf` 的 UID/GID可被环境覆盖
- 包内 `.env.example``PKG_VERSION=<VERSION>` 与镜像 tag 严格一致。
---
## 常见问题FAQ
- 构建报 `failed to receive status: context canceled`
- 已内置单镜像多次重试,最后一次禁用 BuildKit建议加 `--intranet``--no-cache` 重试,或 `docker builder prune -f` 后再试。
- 先跑非 pkglatest再跑 pkgversion会不会“打错包”
- 不会。涉及 UID/GID 的层因参数变化会重建,其它层按缓存命中复用,最终 pkg 产物的属主与运行账户按 `build_user.pkg.conf` 生效。
- swarm_tests 默认拉取 `:latest`,我只构建了 `:<VERSION>` 的 CPU bundle 怎么办?
- 在运行前 `export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:<VERSION>`,或在构建时加 `--tag-latest`
---
如需进一步自动化(例如生成 BUILD_SUMMARY.txt 汇总镜像 digest 与构建参数),可在 pkg 产出阶段追加,我可以按需补齐。

View File

@ -12,11 +12,7 @@ Options:
--master-offline Build master offline image (requires src/master/offline_wheels.tar.gz)
--metric Build metric module images (ftp, prometheus, grafana, test nodes)
--no-cache Build all images without using Docker layer cache
--only LIST Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
--version DATE Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
--client-semver X.Y.Z Override client semver used in all-in-one-full artifact (optional)
--cuda VER CUDA runtime version for NVIDIA base (default: 12.2.2)
--tag-latest Also tag bundle image as :latest (for cpu_bundle only; default off)
--only LIST Comma-separated targets to build: core,master,metric,web,alert,sys,all
-h, --help Show this help message
Examples:
@ -36,20 +32,8 @@ build_metric=true
build_web=true
build_alert=true
build_sys=true
build_gpu_bundle=false
build_cpu_bundle=false
build_server_pkg=false
build_client_pkg=false
need_bind_image=true
need_metric_ftp=true
no_cache=false
bundle_date=""
client_semver=""
cuda_ver="12.2.2"
DEFAULT_IMAGE_TAG="latest"
tag_latest=false
while [[ $# -gt 0 ]]; do
case $1 in
--intranet)
@ -79,7 +63,7 @@ while [[ $# -gt 0 ]]; do
fi
sel="$2"; shift 2
# reset all, then enable selected
build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false
IFS=',' read -ra parts <<< "$sel"
for p in "${parts[@]}"; do
case "$p" in
@ -89,31 +73,11 @@ while [[ $# -gt 0 ]]; do
web) build_web=true ;;
alert) build_alert=true ;;
sys) build_sys=true ;;
gpu_bundle) build_gpu_bundle=true ;;
cpu_bundle) build_cpu_bundle=true ;;
server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
client_pkg) build_client_pkg=true ;;
all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
*) echo "Unknown --only target: $p" >&2; exit 1 ;;
esac
done
;;
--version)
if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
bundle_date="$2"; shift 2
;;
--client-semver)
if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
client_semver="$2"; shift 2
;;
--cuda)
if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
cuda_ver="$2"; shift 2
;;
--tag-latest)
tag_latest=true
shift
;;
-h|--help)
show_help
exit 0
@ -126,11 +90,6 @@ while [[ $# -gt 0 ]]; do
esac
done
if [[ "$build_server_pkg" == true ]]; then
need_bind_image=false
need_metric_ftp=false
fi
root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
. "$root/scripts/common/build_user.sh"
@ -142,16 +101,6 @@ fi
cd "$root"
# Set default image tag policy before building
if [[ "$build_server_pkg" == true ]]; then
DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
fi
# Select build user profile for pkg vs default
if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
export ARGUS_BUILD_PROFILE=pkg
fi
load_build_user
build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
@ -214,31 +163,13 @@ build_image() {
echo " Tag: $tag"
echo " Context: $context"
local tries=${ARGUS_BUILD_RETRIES:-3}
local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
local attempt=1
while (( attempt <= tries )); do
local prefix=""
if (( attempt == tries )); then
# final attempt: disable BuildKit to avoid docker/dockerfile front-end pulls
prefix="DOCKER_BUILDKIT=0"
echo " Attempt ${attempt}/${tries} (fallback: DOCKER_BUILDKIT=0)"
else
echo " Attempt ${attempt}/${tries}"
fi
if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
echo "$image_name image built successfully"
return 0
fi
echo "⚠️ Build failed for $image_name (attempt ${attempt}/${tries})."
if (( attempt < tries )); then
echo " Retrying in ${delay}s..."
sleep "$delay"
fi
attempt=$((attempt+1))
done
echo "❌ Failed to build $image_name image after ${tries} attempts"
return 1
if docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
echo "$image_name image built successfully"
return 0
else
echo "❌ Failed to build $image_name image"
return 1
fi
}
pull_base_image() {
@ -272,385 +203,27 @@ pull_base_image() {
images_built=()
build_failed=false
build_gpu_bundle_image() {
local date_tag="$1" # e.g. 20251112
local cuda_ver_local="$2" # e.g. 12.2.2
local client_ver="$3" # semver like 1.43.0
if [[ -z "$date_tag" ]]; then
echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
return 1
fi
# sanitize cuda version (trim trailing dots like '12.2.')
while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
# Resolve effective CUDA base tag
local resolve_cuda_base_tag
resolve_cuda_base_tag() {
local want="$1" # can be 12, 12.2 or 12.2.2
local major minor patch
if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
# try to find best local patch for major.minor
local best
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
sort -V | tail -n1 || true)
if [[ -n "$best" ]]; then
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
fi
# fallback patch if none local
echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
elif [[ "$want" =~ ^([0-9]+)$ ]]; then
major="${BASH_REMATCH[1]}"
# try to find best local for this major
local best
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
sort -V | tail -n1 || true)
if [[ -n "$best" ]]; then
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
fi
echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
else
# invalid format, fallback to default
echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
fi
}
local base_image
base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
echo
echo "🔧 Preparing one-click GPU bundle build"
echo " CUDA runtime base: ${base_image}"
echo " Bundle tag : ${date_tag}"
# 1) Ensure NVIDIA base image (skip pull if local)
if ! pull_base_image "$base_image"; then
# try once more with default if resolution failed
if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
return 1
else
base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
fi
fi
# 2) Build latest argus-agent from source
echo "\n🛠 Building argus-agent from src/agent"
pushd "$root/src/agent" >/dev/null
if ! bash scripts/build_binary.sh; then
echo "❌ argus-agent build failed" >&2
popd >/dev/null
return 1
fi
if [[ ! -f "dist/argus-agent" ]]; then
echo "❌ argus-agent binary missing after build" >&2
popd >/dev/null
return 1
fi
popd >/dev/null
# 3) Inject agent into all-in-one-full plugin and package artifact
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
local agent_bin_src="$root/src/agent/dist/argus-agent"
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
cp -f "$agent_bin_src" "$agent_bin_dst"
chmod +x "$agent_bin_dst" || true
pushd "$aio_root" >/dev/null
local prev_version
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
local use_version="$prev_version"
if [[ -n "$client_semver" ]]; then
echo "${client_semver}" > config/VERSION
use_version="$client_semver"
fi
echo " Packaging all-in-one-full artifact version: $use_version"
if ! bash scripts/package_artifact.sh --force; then
echo "❌ package_artifact.sh failed" >&2
# restore VERSION if changed
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
return 1
fi
local artifact_dir="$aio_root/artifact/$use_version"
local artifact_tar
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
if [[ -z "$artifact_tar" ]]; then
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
local owner="$(id -u):$(id -g)"
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
echo "❌ publish_artifact.sh failed" >&2
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
return 1
fi
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
fi
if [[ -z "$artifact_tar" ]]; then
echo "❌ artifact tar not found under $artifact_dir" >&2
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
return 1
fi
# restore VERSION if changed (keep filesystem clean)
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
# 4) Stage docker build context
local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
echo "\n🧰 Staging docker build context: $bundle_ctx"
rm -rf "$bundle_ctx"
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
# bundle tar
cp "$artifact_tar" "$bundle_ctx/bundle/"
# offline fluent-bit assets (optional but useful)
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
fi
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
fi
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
fi
# 5) Build the final bundle image (directly from NVIDIA base)
local image_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
echo "\n🔄 Building GPU Bundle image"
if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
--build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
--build-arg CLIENT_VER="$use_version" \
--build-arg BUNDLE_DATE="$date_tag"; then
images_built+=("$image_tag")
# In non-pkg mode, also tag latest for convenience
if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu:latest >/dev/null 2>&1 || true
fi
return 0
else
return 1
fi
}
# Tag helper: ensure :<date_tag> exists for a list of repos
ensure_version_tags() {
local date_tag="$1"; shift
local repos=("$@")
for repo in "${repos[@]}"; do
if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
:
elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
docker tag "$repo:latest" "$repo:$date_tag" || true
else
echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
return 1
fi
done
return 0
}
# Build server package after images are built
build_server_pkg_bundle() {
local date_tag="$1"
if [[ -z "$date_tag" ]]; then
echo "❌ server_pkg requires --version YYMMDD" >&2
return 1
fi
local repos=(
argus-master argus-elasticsearch argus-kibana \
argus-metric-prometheus argus-metric-grafana \
argus-alertmanager argus-web-frontend argus-web-proxy
)
echo "\n🔖 Verifying server images with :$date_tag and collecting digests (Bind/FTP excluded; relying on Docker DNS aliases)"
for repo in "${repos[@]}"; do
if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
return 1
fi
done
# Optional: show digests
for repo in "${repos[@]}"; do
local digest
digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
printf ' • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
done
echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
echo "❌ make_server_package.sh failed" >&2
return 1
fi
return 0
}
# Build client package: ensure gpu bundle image exists, then package client_gpu
build_client_pkg_bundle() {
local date_tag="$1"
local semver="$2"
local cuda="$3"
if [[ -z "$date_tag" ]]; then
echo "❌ client_pkg requires --version YYMMDD" >&2
return 1
fi
local bundle_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
ARGUS_PKG_BUILD=1
export ARGUS_PKG_BUILD
if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
return 1
fi
else
echo "\n✅ Using existing GPU bundle image: $bundle_tag"
fi
echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
echo "❌ make_client_gpu_package.sh failed" >&2
return 1
fi
return 0
}
# Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
build_cpu_bundle_image() {
local date_tag="$1" # e.g. 20251113
local client_ver_in="$2" # semver like 1.43.0 (optional)
local want_tag_latest="$3" # true/false
if [[ -z "$date_tag" ]]; then
echo "❌ cpu_bundle requires --version YYMMDD" >&2
return 1
fi
echo "\n🔧 Preparing one-click CPU bundle build"
echo " Base: ubuntu:22.04"
echo " Bundle tag: ${date_tag}"
# 1) Build latest argus-agent from source
echo "\n🛠 Building argus-agent from src/agent"
pushd "$root/src/agent" >/dev/null
if ! bash scripts/build_binary.sh; then
echo "❌ argus-agent build failed" >&2
popd >/dev/null
return 1
fi
if [[ ! -f "dist/argus-agent" ]]; then
echo "❌ argus-agent binary missing after build" >&2
popd >/dev/null
return 1
fi
popd >/dev/null
# 2) Inject agent into all-in-one-full plugin and package artifact
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
local agent_bin_src="$root/src/agent/dist/argus-agent"
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
cp -f "$agent_bin_src" "$agent_bin_dst"
chmod +x "$agent_bin_dst" || true
pushd "$aio_root" >/dev/null
local prev_version use_version
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
use_version="$prev_version"
if [[ -n "$client_ver_in" ]]; then
echo "$client_ver_in" > config/VERSION
use_version="$client_ver_in"
fi
echo " Packaging all-in-one-full artifact: version=$use_version"
if ! bash scripts/package_artifact.sh --force; then
echo "❌ package_artifact.sh failed" >&2
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
popd >/dev/null
return 1
fi
local artifact_dir="$aio_root/artifact/$use_version"
local artifact_tar
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
if [[ -z "$artifact_tar" ]]; then
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
local owner="$(id -u):$(id -g)"
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
echo "❌ publish_artifact.sh failed" >&2
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
popd >/dev/null
return 1
fi
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
fi
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
popd >/dev/null
# 3) Stage docker build context
local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
echo "\n🧰 Staging docker build context: $bundle_ctx"
rm -rf "$bundle_ctx"
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
# bundle tar
cp "$artifact_tar" "$bundle_ctx/bundle/"
# offline fluent-bit assets
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
fi
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
fi
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
fi
# 4) Build final bundle image
local image_tag="argus-sys-metric-test-node-bundle:${date_tag}"
echo "\n🔄 Building CPU Bundle image"
if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
images_built+=("$image_tag")
if [[ "$want_tag_latest" == "true" ]]; then
docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
fi
return 0
else
return 1
fi
}
if [[ "$build_core" == true ]]; then
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:${DEFAULT_IMAGE_TAG}"; then
images_built+=("argus-elasticsearch:${DEFAULT_IMAGE_TAG}")
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:latest"; then
images_built+=("argus-elasticsearch:latest")
else
build_failed=true
fi
echo ""
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:${DEFAULT_IMAGE_TAG}"; then
images_built+=("argus-kibana:${DEFAULT_IMAGE_TAG}")
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:latest"; then
images_built+=("argus-kibana:latest")
else
build_failed=true
fi
echo ""
if [[ "$need_bind_image" == true ]]; then
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:${DEFAULT_IMAGE_TAG}"; then
images_built+=("argus-bind9:${DEFAULT_IMAGE_TAG}")
else
build_failed=true
fi
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:latest"; then
images_built+=("argus-bind9:latest")
else
build_failed=true
fi
fi
@ -660,7 +233,7 @@ if [[ "$build_master" == true ]]; then
echo ""
echo "🔄 Building Master image..."
pushd "$master_root" >/dev/null
master_args=("--tag" "argus-master:${DEFAULT_IMAGE_TAG}")
master_args=("--tag" "argus-master:latest")
if [[ "$use_intranet" == true ]]; then
master_args+=("--intranet")
fi
@ -674,7 +247,7 @@ if [[ "$build_master" == true ]]; then
if [[ "$build_master_offline" == true ]]; then
images_built+=("argus-master:offline")
else
images_built+=("argus-master:${DEFAULT_IMAGE_TAG}")
images_built+=("argus-master:latest")
fi
else
build_failed=true
@ -687,27 +260,21 @@ if [[ "$build_metric" == true ]]; then
echo "Building Metric module images..."
metric_base_images=(
"ubuntu:22.04"
"ubuntu/prometheus:3-24.04_stable"
"grafana/grafana:11.1.0"
)
if [[ "$need_metric_ftp" == true ]]; then
metric_base_images+=("ubuntu:22.04")
fi
for base_image in "${metric_base_images[@]}"; do
if ! pull_base_image "$base_image"; then
build_failed=true
fi
done
metric_builds=()
if [[ "$need_metric_ftp" == true ]]; then
metric_builds+=("Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build")
fi
metric_builds+=(
"Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
"Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
metric_builds=(
"Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:latest|src/metric/ftp/build"
"Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:latest|src/metric/prometheus/build"
"Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:latest|src/metric/grafana/build"
)
for build_spec in "${metric_builds[@]}"; do
@ -779,8 +346,8 @@ if [[ "$build_web" == true || "$build_alert" == true ]]; then
if [[ "$build_web" == true ]]; then
web_builds=(
"Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:${DEFAULT_IMAGE_TAG}|."
"Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:${DEFAULT_IMAGE_TAG}|."
"Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:latest|."
"Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:latest|."
)
for build_spec in "${web_builds[@]}"; do
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
@ -795,7 +362,7 @@ if [[ "$build_web" == true || "$build_alert" == true ]]; then
if [[ "$build_alert" == true ]]; then
alert_builds=(
"Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:${DEFAULT_IMAGE_TAG}|."
"Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:latest|."
)
for build_spec in "${alert_builds[@]}"; do
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
@ -809,49 +376,6 @@ if [[ "$build_web" == true || "$build_alert" == true ]]; then
fi
fi
# =======================================
# One-click GPU bundle (direct NVIDIA base)
# =======================================
if [[ "$build_gpu_bundle" == true ]]; then
echo ""
echo "Building one-click GPU bundle image..."
if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
build_failed=true
fi
fi
# =======================================
# One-click CPU bundle (from ubuntu:22.04)
# =======================================
if [[ "$build_cpu_bundle" == true ]]; then
echo ""
echo "Building one-click CPU bundle image..."
if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
build_failed=true
fi
fi
# =======================================
# One-click Server/Client packaging
# =======================================
if [[ "$build_server_pkg" == true ]]; then
echo ""
echo "🧳 Building one-click Server package..."
if ! build_server_pkg_bundle "${bundle_date}"; then
build_failed=true
fi
fi
if [[ "$build_client_pkg" == true ]]; then
echo ""
echo "🧳 Building one-click Client-GPU package..."
if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
build_failed=true
fi
fi
echo "======================================="
echo "📦 Build Summary"
echo "======================================="

View File

@ -1,6 +0,0 @@
# Default build-time UID/GID for Argus images
# Override by creating configs/build_user.local.conf with the same format.
# Syntax: KEY=VALUE, supports UID/GID only. Whitespace and lines starting with # are ignored.
UID=2133
GID=2015

View File

@ -1 +0,0 @@
artifact/

View File

@ -1,14 +0,0 @@
# deployment_new
本目录用于新的部署打包与交付实现(不影响既有 `deployment/`)。
里程碑 M1当前实现
- `build/make_server_package.sh`:生成 Server 包(逐服务镜像 tar.gz、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz
- `build/make_client_gpu_package.sh`:生成 ClientGPU 包GPU bundle 镜像 tar.gz、busybox.tar、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz
模板
- `templates/server/compose/docker-compose.yml`:部署专用,镜像默认使用 `:${PKG_VERSION}` 版本 tag可通过 `.env` 覆盖。
- `templates/client_gpu/compose/docker-compose.yml`GPU 节点专用,使用 `:${PKG_VERSION}` 版本 tag。
注意M1 仅产出安装包,不包含安装脚本落地;安装/运维脚本将在 M2 落地并纳入包内。

View File

@ -1,33 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
log() { echo -e "\033[0;34m[INFO]\033[0m $*"; }
warn() { echo -e "\033[1;33m[WARN]\033[0m $*"; }
err() { echo -e "\033[0;31m[ERR ]\033[0m $*" >&2; }
require_cmd() {
local miss=0
for c in "$@"; do
if ! command -v "$c" >/dev/null 2>&1; then err "missing command: $c"; miss=1; fi
done
[[ $miss -eq 0 ]]
}
today_version() { date +%Y%m%d; }
checksum_dir() {
local dir="$1"; local out="$2"; : > "$out";
(cd "$dir" && find . -type f -print0 | sort -z | xargs -0 sha256sum) >> "$out"
}
make_dir() { mkdir -p "$1"; }
copy_tree() {
local src="$1" dst="$2"; rsync -a --delete "$src/" "$dst/" 2>/dev/null || cp -r "$src/." "$dst/";
}
gen_manifest() {
local root="$1"; local out="$2"; : > "$out";
(cd "$root" && find . -maxdepth 4 -type f -printf "%p\n" | sort) >> "$out"
}

View File

@ -1,131 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# Make client GPU package (versioned gpu bundle image, compose, env, docs, busybox)
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/client_gpu"
ART_ROOT="$ROOT_DIR/deployment_new/artifact/client_gpu"
# Use deployment_new local common helpers
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
. "$COMMON_SH"
usage(){ cat <<EOF
Build Client-GPU Package (deployment_new)
Usage: $(basename "$0") --version YYYYMMDD [--image IMAGE[:TAG]]
Defaults:
image = argus-sys-metric-test-node-bundle-gpu:latest
Outputs: deployment_new/artifact/client_gpu/<YYYYMMDD>/ and client_gpu_YYYYMMDD.tar.gz
EOF
}
VERSION=""
IMAGE="argus-sys-metric-test-node-bundle-gpu:latest"
while [[ $# -gt 0 ]]; do
case "$1" in
--version) VERSION="$2"; shift 2;;
--image) IMAGE="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) err "unknown arg: $1"; usage; exit 1;;
esac
done
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
require_cmd docker tar gzip
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
PKG_DIR="$ART_ROOT/$VERSION"
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
# 1) Save GPU bundle image with version tag
if ! docker image inspect "$IMAGE" >/dev/null 2>&1; then
err "missing image: $IMAGE"; exit 1; fi
REPO="${IMAGE%%:*}"; TAG_VER="$REPO:$VERSION"
docker tag "$IMAGE" "$TAG_VER"
out_tar="$STAGE/images/${REPO//\//-}-$VERSION.tar"
docker save -o "$out_tar" "$TAG_VER"
gzip -f "$out_tar"
# 2) Busybox tar for connectivity/overlay warmup (prefer local template; fallback to docker save)
BB_SRC="$TEMPL_DIR/images/busybox.tar"
if [[ -f "$BB_SRC" ]]; then
cp "$BB_SRC" "$STAGE/images/busybox.tar"
else
if docker image inspect busybox:latest >/dev/null 2>&1 || docker pull busybox:latest >/dev/null 2>&1; then
docker save -o "$STAGE/images/busybox.tar" busybox:latest
log "Included busybox from local docker daemon"
else
warn "busybox image not found and cannot pull; skipping busybox.tar"
fi
fi
# 3) Compose + env template and docs/scripts from templates
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
ENV_EX="$STAGE/compose/.env.example"
cat >"$ENV_EX" <<EOF
# Generated by make_client_gpu_package.sh
PKG_VERSION=$VERSION
NODE_GPU_BUNDLE_IMAGE_TAG=${REPO}:${VERSION}
# Compose project name (isolation from server stack)
COMPOSE_PROJECT_NAME=argus-client
# Required (no defaults). Must be filled before install.
AGENT_ENV=
AGENT_USER=
AGENT_INSTANCE=
GPU_NODE_HOSTNAME=
# Overlay network (should match server包 overlay)
ARGUS_OVERLAY_NET=argus-sys-net
# From cluster-info.env (server package output)
SWARM_MANAGER_ADDR=
SWARM_JOIN_TOKEN_WORKER=
SWARM_JOIN_TOKEN_MANAGER=
EOF
# 4) Docs from deployment_new templates
CLIENT_DOC_SRC="$TEMPL_DIR/docs"
if [[ -d "$CLIENT_DOC_SRC" ]]; then
rsync -a "$CLIENT_DOC_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$CLIENT_DOC_SRC/." "$STAGE/docs/"
fi
# Placeholder scripts (will be implemented in M2)
cat >"$STAGE/scripts/README.md" <<'EOF'
# Client-GPU Scripts (Placeholder)
本目录将在 M2 引入:
- config.sh / install.sh
当前为占位,便于包结构审阅。
EOF
# 5) Scripts (from deployment_new templates) and Private skeleton
SCRIPTS_SRC="$TEMPL_DIR/scripts"
if [[ -d "$SCRIPTS_SRC" ]]; then
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
fi
mkdir -p "$STAGE/private/argus/agent"
# 6) Manifest & checksums
gen_manifest "$STAGE" "$STAGE/manifest.txt"
checksum_dir "$STAGE" "$STAGE/checksums.txt"
# 7) Move to artifact dir and pack
mkdir -p "$PKG_DIR"
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
OUT_TAR="$OUT_TAR_DIR/client_gpu_${VERSION}.tar.gz"
log "Creating tarball: $OUT_TAR"
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
log "Client-GPU package ready: $PKG_DIR"
echo "$OUT_TAR"

View File

@ -1,160 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# Make server deployment package (versioned, per-image tars, full compose, docs, skeleton)
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/server"
ART_ROOT="$ROOT_DIR/deployment_new/artifact/server"
# Use deployment_new local common helpers
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
. "$COMMON_SH"
usage(){ cat <<EOF
Build Server Deployment Package (deployment_new)
Usage: $(basename "$0") --version YYYYMMDD
Outputs: deployment_new/artifact/server/<YYYYMMDD>/ and server_YYYYMMDD.tar.gz
EOF
}
VERSION=""
while [[ $# -gt 0 ]]; do
case "$1" in
--version) VERSION="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) err "unknown arg: $1"; usage; exit 1;;
esac
done
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
require_cmd docker tar gzip awk sed
IMAGES=(
argus-master
argus-elasticsearch
argus-kibana
argus-metric-prometheus
argus-metric-grafana
argus-alertmanager
argus-web-frontend
argus-web-proxy
)
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
PKG_DIR="$ART_ROOT/$VERSION"
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
# 1) Save per-image tars with version tag
log "Tagging and saving images (version=$VERSION)"
for repo in "${IMAGES[@]}"; do
if ! docker image inspect "$repo:latest" >/dev/null 2>&1 && ! docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
err "missing image: $repo (need :latest or :$VERSION)"; exit 1; fi
if docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
tag="$repo:$VERSION"
else
docker tag "$repo:latest" "$repo:$VERSION"
tag="$repo:$VERSION"
fi
out_tar="$STAGE/images/${repo//\//-}-$VERSION.tar"
docker save -o "$out_tar" "$tag"
gzip -f "$out_tar"
done
# 2) Compose + env template
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
ENV_EX="$STAGE/compose/.env.example"
cat >"$ENV_EX" <<EOF
# Generated by make_server_package.sh
PKG_VERSION=$VERSION
# Image tags (can be overridden). Default to versioned tags
MASTER_IMAGE_TAG=argus-master:
ES_IMAGE_TAG=argus-elasticsearch:
KIBANA_IMAGE_TAG=argus-kibana:
PROM_IMAGE_TAG=argus-metric-prometheus:
GRAFANA_IMAGE_TAG=argus-metric-grafana:
ALERT_IMAGE_TAG=argus-alertmanager:
FRONT_IMAGE_TAG=argus-web-frontend:
WEB_PROXY_IMAGE_TAG=argus-web-proxy:
EOF
sed -i "s#:\$#:${VERSION}#g" "$ENV_EX"
# Ports and defaults (based on swarm_tests .env.example)
cat >>"$ENV_EX" <<'EOF'
# Host ports for server compose
MASTER_PORT=32300
ES_HTTP_PORT=9200
KIBANA_PORT=5601
PROMETHEUS_PORT=9090
GRAFANA_PORT=3000
ALERTMANAGER_PORT=9093
WEB_PROXY_PORT_8080=8080
WEB_PROXY_PORT_8081=8081
WEB_PROXY_PORT_8082=8082
WEB_PROXY_PORT_8083=8083
WEB_PROXY_PORT_8084=8084
WEB_PROXY_PORT_8085=8085
# Overlay network name
ARGUS_OVERLAY_NET=argus-sys-net
# UID/GID for volume ownership
ARGUS_BUILD_UID=2133
ARGUS_BUILD_GID=2015
# Compose project name (isolation from other stacks on same host)
COMPOSE_PROJECT_NAME=argus-server
EOF
# 3) Docs (from deployment_new templates)
DOCS_SRC="$TEMPL_DIR/docs"
if [[ -d "$DOCS_SRC" ]]; then
rsync -a "$DOCS_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$DOCS_SRC/." "$STAGE/docs/"
fi
# 6) Scripts (from deployment_new templates)
SCRIPTS_SRC="$TEMPL_DIR/scripts"
if [[ -d "$SCRIPTS_SRC" ]]; then
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
fi
# 4) Private skeleton (minimum)
mkdir -p \
"$STAGE/private/argus/etc" \
"$STAGE/private/argus/master" \
"$STAGE/private/argus/metric/prometheus" \
"$STAGE/private/argus/metric/prometheus/data" \
"$STAGE/private/argus/metric/prometheus/rules" \
"$STAGE/private/argus/metric/prometheus/targets" \
"$STAGE/private/argus/metric/grafana" \
"$STAGE/private/argus/metric/grafana/data" \
"$STAGE/private/argus/metric/grafana/logs" \
"$STAGE/private/argus/metric/grafana/plugins" \
"$STAGE/private/argus/metric/grafana/provisioning/datasources" \
"$STAGE/private/argus/metric/grafana/provisioning/dashboards" \
"$STAGE/private/argus/metric/grafana/data/sessions" \
"$STAGE/private/argus/metric/grafana/data/dashboards" \
"$STAGE/private/argus/metric/grafana/config" \
"$STAGE/private/argus/alert/alertmanager" \
"$STAGE/private/argus/log/elasticsearch" \
"$STAGE/private/argus/log/kibana"
# 7) Manifest & checksums
gen_manifest "$STAGE" "$STAGE/manifest.txt"
checksum_dir "$STAGE" "$STAGE/checksums.txt"
# 8) Move to artifact dir and pack
mkdir -p "$PKG_DIR"
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
OUT_TAR="$OUT_TAR_DIR/server_${VERSION}.tar.gz"
log "Creating tarball: $OUT_TAR"
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
log "Server package ready: $PKG_DIR"
echo "$OUT_TAR"

View File

@ -1,38 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
metric-gpu-node:
image: ${NODE_GPU_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle-gpu:${PKG_VERSION}}
container_name: argus-metric-gpu-node-swarm
hostname: ${GPU_NODE_HOSTNAME}
restart: unless-stopped
privileged: true
runtime: nvidia
environment:
- TZ=Asia/Shanghai
- DEBIAN_FRONTEND=noninteractive
- MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
# Fluent Bit / 日志上报目标(固定域名)
- ES_HOST=es.log.argus.com
- ES_PORT=9200
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- AGENT_ENV=${AGENT_ENV}
- AGENT_USER=${AGENT_USER}
- AGENT_INSTANCE=${AGENT_INSTANCE}
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- GPU_MODE=gpu
networks:
argus-sys-net:
aliases:
- ${AGENT_INSTANCE}.node.argus.com
volumes:
- ../private/argus/agent:/private/argus/agent
- ../logs/infer:/logs/infer
- ../logs/train:/logs/train
command: ["sleep", "infinity"]

View File

@ -1,73 +0,0 @@
# Argus ClientGPU 安装指南deployment_new
## 一、准备条件(开始前确认)
- GPU 节点安装了 NVIDIA 驱动,`nvidia-smi` 正常;
- Docker & Docker Compose v2 已安装;
- 使用统一账户 `argus`UID=2133GID=2015执行安装并加入 `docker` 组(如已创建可跳过):
```bash
sudo groupadd --gid 2015 argus || true
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
sudo passwd argus
sudo usermod -aG docker argus
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
```
后续解压与执行config/install/uninstall均使用 `argus` 账户进行。
- 从 Server 安装方拿到 `cluster-info.env`(包含 `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`compose 架构下 BINDIP/FTPIP 不再使用)。
## 二、解包
- `tar -xzf client_gpu_YYYYMMDD.tar.gz`
- 进入目录:`cd client_gpu_YYYYMMDD/`
- 你应当看到:`images/`GPU bundle、busybox`compose/``scripts/``docs/`
## 三、配置 config预热 overlay + 生成 .env
命令:
```
cp /path/to/cluster-info.env ./ # 或 export CLUSTER_INFO=/abs/path/cluster-info.env
./scripts/config.sh
```
脚本做了什么:
- 读取 `cluster-info.env``docker swarm join`(幂等);
- 自动用 busybox 预热 external overlay `argus-sys-net`,等待最多 60s 直到本机可见;
- 生成/更新 `compose/.env`:填入 `SWARM_*`,并“保留你已填写的 AGENT_* 与 GPU_NODE_HOSTNAME”不会覆盖
看到什么才算成功:
- 终端输出类似:`已预热 overlay=argus-sys-net 并生成 compose/.env可执行 scripts/install.sh`
- `compose/.env` 至少包含:
- `AGENT_ENV/AGENT_USER/AGENT_INSTANCE/GPU_NODE_HOSTNAME`(需要你提前填写);
- `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`
- `NODE_GPU_BUNDLE_IMAGE_TAG=...:YYYYMMDD`
### 日志映射(重要)
- 容器内 `/logs/infer``/logs/train` 已映射到包根 `./logs/infer``./logs/train`
- 你可以直接在宿主机查看推理/训练日志:`tail -f logs/infer/*.log``tail -f logs/train/*.log`
- install 脚本会自动创建这两个目录。
若提示缺少必填项:
- 打开 `compose/.env` 按提示补齐 `AGENT_*``GPU_NODE_HOSTNAME`,再次执行 `./scripts/config.sh`(脚本不会覆盖你已填的值)。
## 四、安装 install加载镜像 + 起容器 + 跟日志)
命令:
```
./scripts/install.sh
```
脚本做了什么:
- 如有必要,先自动预热 overlay
- 从 `images/` 导入 `argus-sys-metric-test-node-bundle-gpu-*.tar.gz` 到本地 Docker
- `docker compose up -d` 启动 GPU 节点容器,并自动执行 `docker logs -f argus-metric-gpu-node-swarm` 跟踪安装过程。
看到什么才算成功:
- 日志中出现:`[BOOT] local bundle install OK: version=...` / `dcgm-exporter ... listening` / `node state present: /private/argus/agent/<hostname>/node.json`
- `docker exec argus-metric-gpu-node-swarm nvidia-smi -L` 能列出 GPU
- 在 Server 侧 Prometheus `/api/v1/targets`GPU 节点 9100node-exporter与 9400dcgm-exporter至少其一 up。
## 五、卸载 uninstall
命令:
```
./scripts/uninstall.sh
```
行为Compose down如有 .env并删除 warmup 容器与节点容器。
## 六、常见问题
- `本机未看到 overlay`config/install 已自动预热;若仍失败,请检查与 manager 的网络连通性以及 manager 上是否已创建 `argus-sys-net`
- `busybox 缺失`:确保包根 `images/busybox.tar` 在,或主机已有 `busybox:latest`
- `加入 Swarm 失败`:确认 `cluster-info.env``SWARM_MANAGER_ADDR``SWARM_JOIN_TOKEN_WORKER` 正确,或在 manager 上重新 `docker swarm join-token -q worker` 后更新该文件。

View File

@ -1,90 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_EX="$PKG_ROOT/compose/.env.example"
ENV_OUT="$PKG_ROOT/compose/.env"
info(){ echo -e "\033[34m[CONFIG-GPU]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker curl jq awk sed tar gzip
require_compose
# 磁盘空间检查MB
check_disk(){ local p="$1"; local need=10240; local free
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
}
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
# 导入 cluster-info.env默认取当前包根也可用 CLUSTER_INFO 指定路径)
CI_IN="${CLUSTER_INFO:-$PKG_ROOT/cluster-info.env}"
info "读取 cluster-info.env: $CI_IN"
[[ -f "$CI_IN" ]] || { err "找不到 cluster-info.env默认当前包根或设置环境变量 CLUSTER_INFO 指定绝对路径)"; exit 1; }
set -a; source "$CI_IN"; set +a
[[ -n "${SWARM_MANAGER_ADDR:-}" && -n "${SWARM_JOIN_TOKEN_WORKER:-}" ]] || { err "cluster-info.env 缺少 SWARM 信息SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_WORKER"; exit 1; }
# 加入 Swarm幂等
info "加入 Swarm幂等$SWARM_MANAGER_ADDR"
docker swarm join --token "$SWARM_JOIN_TOKEN_WORKER" "$SWARM_MANAGER_ADDR":2377 >/dev/null 2>&1 || true
# 导入 busybox 并做 overlay 预热与连通性(总是执行)
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
# 准备 busybox
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then
info "加载 busybox.tar 以预热 overlay"
docker load -i "$PKG_ROOT/images/busybox.tar" >/dev/null
else
err "缺少 busybox 镜像(包内 images/busybox.tar 或本地 busybox:latest无法预热 overlay $NET_NAME"; exit 1
fi
fi
# 预热容器worker 侧加入 overlay 以便本地可见)
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
info "启动 warmup 容器加入 overlay: $NET_NAME"
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && { info "overlay 可见 (t=${i}s)"; break; }; sleep 1; done
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
# 通过 warmup 容器测试实际数据通路alias → master
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
err "warmup 容器内无法通过别名访问 master.argus.com请确认 server compose 已启动并加入 overlay $NET_NAME"
exit 1
fi
info "warmup 容器内可达 master.argus.comDocker DNS + alias 正常)"
# 生成/更新 .env保留人工填写项不覆盖已有键
if [[ ! -f "$ENV_OUT" ]]; then
cp "$ENV_EX" "$ENV_OUT"
fi
set_kv(){ local k="$1" v="$2"; if grep -q "^${k}=" "$ENV_OUT"; then sed -i -E "s#^${k}=.*#${k}=${v}#" "$ENV_OUT"; else echo "${k}=${v}" >> "$ENV_OUT"; fi }
set_kv SWARM_MANAGER_ADDR "${SWARM_MANAGER_ADDR:-}"
set_kv SWARM_JOIN_TOKEN_WORKER "${SWARM_JOIN_TOKEN_WORKER:-}"
set_kv SWARM_JOIN_TOKEN_MANAGER "${SWARM_JOIN_TOKEN_MANAGER:-}"
REQ_VARS=(AGENT_ENV AGENT_USER AGENT_INSTANCE GPU_NODE_HOSTNAME)
missing=()
for v in "${REQ_VARS[@]}"; do
val=$(grep -E "^$v=" "$ENV_OUT" | head -1 | cut -d= -f2-)
if [[ -z "$val" ]]; then missing+=("$v"); fi
done
if [[ ${#missing[@]} -gt 0 ]]; then
err "以下变量必须在 compose/.env 中填写:${missing[*]}(已保留你现有的内容,不会被覆盖)"; exit 1; fi
info "已生成 compose/.env可执行 scripts/install.sh"
# 准备并赋权宿主日志目录(幂等,便于安装前人工检查/预创建)
mkdir -p "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer"
chmod 1777 "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" || true
info "日志目录权限(期待 1777含粘滞位:"
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" 2>/dev/null || true

View File

@ -1,72 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
info(){ echo -e "\033[34m[INSTALL-GPU]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker nvidia-smi
require_compose
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env请先运行 scripts/config.sh"; exit 1; }
info "使用环境文件: $ENV_FILE"
# 预热 overlay当 config 执行很久之前或容器已被清理时warmup 可能不存在)
set -a; source "$ENV_FILE"; set +a
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
info "检查 overlay 网络可见性: $NET_NAME"
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
# 如 Overlay 不可见,尝试用 busybox 预热(仅为确保 worker 节点已加入 overlay
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then docker load -i "$PKG_ROOT/images/busybox.tar"; else err "缺少 busybox 镜像images/busybox.tar 或本地 busybox:latest"; exit 1; fi
fi
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && break; sleep 1; done
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
info "overlay 已可见warmup=argus-net-warmup"
fi
# 若本函数内重新创建了 warmup 容器,同样测试一次 alias 数据通路
if docker ps --format '{{.Names}}' | grep -q '^argus-net-warmup$'; then
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
err "GPU install 阶段warmup 容器内无法通过别名访问 master.argus.com请检查 overlay $NET_NAME 与 server 状态"
exit 1
fi
info "GPU install 阶段warmup 容器内可达 master.argus.com"
fi
# 导入 GPU bundle 镜像
IMG_TGZ=$(ls -1 "$PKG_ROOT"/images/argus-sys-metric-test-node-bundle-gpu-*.tar.gz 2>/dev/null | head -1 || true)
[[ -n "$IMG_TGZ" ]] || { err "找不到 GPU bundle 镜像 tar.gz"; exit 1; }
info "导入 GPU bundle 镜像: $(basename "$IMG_TGZ")"
tmp=$(mktemp); gunzip -c "$IMG_TGZ" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
# 确保日志目录存在(宿主侧,用于映射 /logs/infer 与 /logs/train并赋权 1777粘滞位
mkdir -p "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train"
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
info "日志目录已准备并赋权 1777: logs/infer logs/train"
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
# 启动 compose 并跟踪日志
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
info "启动 GPU 节点 (docker compose -p $PROJECT up -d)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
# 再次校准宿主日志目录权限,避免容器内脚本对 bind mount 权限回退
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
info "跟踪节点容器日志(按 Ctrl+C 退出)"
docker logs -f argus-metric-gpu-node-swarm || true

View File

@ -1,36 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
# load COMPOSE_PROJECT_NAME if provided in compose/.env
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
info(){ echo -e "\033[34m[UNINSTALL-GPU]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require_compose
if [[ -f "$ENV_FILE" ]]; then
info "stopping compose project (project=$PROJECT)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
else
info "compose/.env not found; attempting to remove container by name"
fi
# remove warmup container if still running
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
# remove node container if present
docker rm -f argus-metric-gpu-node-swarm >/dev/null 2>&1 || true
info "uninstall completed"

View File

@ -1,169 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
master:
image: ${MASTER_IMAGE_TAG:-argus-master:${PKG_VERSION}}
container_name: argus-master-sys
environment:
- OFFLINE_THRESHOLD_SECONDS=6
- ONLINE_THRESHOLD_SECONDS=2
- SCHEDULER_INTERVAL_SECONDS=1
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${MASTER_PORT:-32300}:3000"
volumes:
- ../private/argus/master:/private/argus/master
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- master.argus.com
restart: unless-stopped
es:
image: ${ES_IMAGE_TAG:-argus-elasticsearch:${PKG_VERSION}}
container_name: argus-es-sys
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms512m -Xmx512m
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/log/elasticsearch:/private/argus/log/elasticsearch
- ../private/argus/etc:/private/argus/etc
ports:
- "${ES_HTTP_PORT:-9200}:9200"
restart: unless-stopped
networks:
argus-sys-net:
aliases:
- es.log.argus.com
kibana:
image: ${KIBANA_IMAGE_TAG:-argus-kibana:${PKG_VERSION}}
container_name: argus-kibana-sys
environment:
- ELASTICSEARCH_HOSTS=http://es.log.argus.com:9200
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/log/kibana:/private/argus/log/kibana
- ../private/argus/etc:/private/argus/etc
depends_on: [es]
ports:
- "${KIBANA_PORT:-5601}:5601"
restart: unless-stopped
networks:
argus-sys-net:
aliases:
- kibana.log.argus.com
prometheus:
image: ${PROM_IMAGE_TAG:-argus-metric-prometheus:${PKG_VERSION}}
container_name: argus-prometheus
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- PROMETHEUS_BASE_PATH=/private/argus/metric/prometheus
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
volumes:
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- prom.metric.argus.com
grafana:
image: ${GRAFANA_IMAGE_TAG:-argus-metric-grafana:${PKG_VERSION}}
container_name: argus-grafana
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- GRAFANA_BASE_PATH=/private/argus/metric/grafana
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- GF_SERVER_HTTP_PORT=3000
- GF_LOG_LEVEL=warn
- GF_LOG_MODE=console
- GF_PATHS_PROVISIONING=/private/argus/metric/grafana/provisioning
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
ports:
- "${GRAFANA_PORT:-3000}:3000"
volumes:
- ../private/argus/metric/grafana:/private/argus/metric/grafana
- ../private/argus/etc:/private/argus/etc
depends_on: [prometheus]
networks:
argus-sys-net:
aliases:
- grafana.metric.argus.com
alertmanager:
image: ${ALERT_IMAGE_TAG:-argus-alertmanager:${PKG_VERSION}}
container_name: argus-alertmanager
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/etc:/private/argus/etc
- ../private/argus/alert/alertmanager:/private/argus/alert/alertmanager
networks:
argus-sys-net:
aliases:
- alertmanager.alert.argus.com
ports:
- "${ALERTMANAGER_PORT:-9093}:9093"
restart: unless-stopped
web-frontend:
image: ${FRONT_IMAGE_TAG:-argus-web-frontend:${PKG_VERSION}}
container_name: argus-web-frontend
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- EXTERNAL_MASTER_PORT=${WEB_PROXY_PORT_8085:-8085}
- EXTERNAL_ALERTMANAGER_PORT=${WEB_PROXY_PORT_8084:-8084}
- EXTERNAL_GRAFANA_PORT=${WEB_PROXY_PORT_8081:-8081}
- EXTERNAL_PROMETHEUS_PORT=${WEB_PROXY_PORT_8082:-8082}
- EXTERNAL_KIBANA_PORT=${WEB_PROXY_PORT_8083:-8083}
volumes:
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- web.argus.com
restart: unless-stopped
web-proxy:
image: ${WEB_PROXY_IMAGE_TAG:-argus-web-proxy:${PKG_VERSION}}
container_name: argus-web-proxy
depends_on: [master, grafana, prometheus, kibana, alertmanager]
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- proxy.argus.com
ports:
- "${WEB_PROXY_PORT_8080:-8080}:8080"
- "${WEB_PROXY_PORT_8081:-8081}:8081"
- "${WEB_PROXY_PORT_8082:-8082}:8082"
- "${WEB_PROXY_PORT_8083:-8083}:8083"
- "${WEB_PROXY_PORT_8084:-8084}:8084"
- "${WEB_PROXY_PORT_8085:-8085}:8085"
restart: unless-stopped

View File

@ -1,102 +0,0 @@
# Argus Server 安装指南deployment_new
适用:通过 Server 安装包在 Docker Swarm + external overlay 网络一体化部署 Argus 服务端组件。
—— 本文强调“怎么做、看什么、符合了才继续”。
## 一、准备条件(开始前确认)
- Docker 与 Docker Compose v2 已安装;`docker info` 正常;`docker compose version` 可执行。
- 具备 root/sudo 权限;磁盘可用空间 ≥ 10GB包根与 `/var/lib/docker`)。
- 你知道本机管理地址SWARM_MANAGER_ADDR该 IP 属于本机某网卡,可被其他节点访问。
- 很重要:以统一账户 `argus`UID=2133GID=2015执行后续安装与运维并将其加入 `docker` 组;示例命令如下(如需不同 UID/GID请替换为贵方标准
```bash
# 1) 创建主组GID=2015组名 argus若已存在可跳过
sudo groupadd --gid 2015 argus || true
# 2) 创建用户 argusUID=2133、主组 GID=2015创建家目录并用 bash 作为默认 shell若已存在可用 usermod 调整)
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
sudo passwd argus
# 3) 将 argus 加入 docker 组,使其能调用 Docker Daemon新登录后生效
sudo usermod -aG docker argus
# 4) 验证(重新登录或执行 newgrp docker 使组生效)
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
```
后续的解压与执行config/install/selfcheck 等)均使用该 `argus` 账户进行。
## 二、解包与目录结构
- 解压:`tar -xzf server_YYYYMMDD.tar.gz`
- 进入:`cd server_YYYYMMDD/`
- 你应当能看到:
- `images/`(逐服务镜像 tar.gz`argus-master-YYYYMMDD.tar.gz`
- `compose/``docker-compose.yml``.env.example`
- `scripts/`(安装/运维脚本)
- `private/argus/`(数据与配置骨架)
- `docs/`(中文文档)
## 三、配置 config生成 .env 与 SWARM_MANAGER_ADDR
命令:
```
export SWARM_MANAGER_ADDR=<本机管理IP>
./scripts/config.sh
```
脚本做了什么:
- 检查依赖与磁盘空间;
- 自动从“端口 20000 起”分配所有服务端口,确保“系统未占用”且“彼此不冲突”;
- 写入 `compose/.env`(包含端口、镜像 tag、overlay 名称与 UID/GID 等);
- 将当前执行账户的 UID/GID 写入 `ARGUS_BUILD_UID/GID`(若主组名是 docker会改用“与用户名同名的组”的 GID避免拿到 docker 组 999
- 更新/追加 `cluster-info.env` 中的 `SWARM_MANAGER_ADDR`(不会覆盖其他键)。
看到什么才算成功:
- 终端输出:`已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。`
- `compose/.env` 打开应当看到:
- 端口均 ≥20000 且没有重复;
- `ARGUS_BUILD_UID/GID``id -u/-g` 一致;
- `SWARM_MANAGER_ADDR=<你的IP>`
遇到问题:
- 端口被异常占用:可删去 `.env` 后再次执行 `config.sh`,或手工编辑端口再执行 `install.sh`
## 四、安装 install一次到位
命令:
```
./scripts/install.sh
```
脚本做了什么:
- 若 Swarm 未激活:执行 `docker swarm init --advertise-addr $SWARM_MANAGER_ADDR`
- 确保 external overlay `argus-sys-net` 存在;
- 导入 `images/*.tar.gz` 到本机 Docker
- `docker compose up -d` 启动服务;
- 等待“六项就绪”:
- Master `/readyz`=200、ES `/_cluster/health`=200、Prometheus TCP 可达、Grafana `/api/health`=200、Alertmanager `/api/v2/status`=200、Kibana `/api/status` level=available
- 校验 Docker DNS + overlay alias`argus-web-proxy` 内通过 `getent hosts``curl` 检查 `master.argus.com``grafana.metric.argus.com` 等域名连通性;
- 写出 `cluster-info.env`(含 `SWARM_JOIN_TOKEN_{WORKER,MANAGER}/SWARM_MANAGER_ADDR`compose 架构下不再依赖 BINDIP/FTPIP
- 生成 `安装报告_YYYYMMDD-HHMMSS.md`(端口、健康检查摘要与提示)。
看到什么才算成功:
- `docker compose ps` 全部是 Up
- `安装报告_…md` 中各项 HTTP 检查为 200/available
- `cluster-info.env` 包含五个关键键:
- `SWARM_MANAGER_ADDR=...`
- `SWARM_MANAGER_ADDR=...` `SWARM_JOIN_TOKEN_*=...`
- `SWARM_JOIN_TOKEN_WORKER=SWMTKN-...`
- `SWARM_JOIN_TOKEN_MANAGER=SWMTKN-...`
## 五、健康自检与常用操作
- 健康自检:`./scripts/selfcheck.sh`
- 期望输出:`selfcheck OK -> logs/selfcheck.json`
- 文件 `logs/selfcheck.json``overlay_net/es/kibana/master_readyz/prometheus/grafana/alertmanager/web_proxy_cors` 为 true。
- 状态:`./scripts/status.sh`(相当于 `docker compose ps`)。
- 诊断:`./scripts/diagnose.sh`(收集容器/HTTP/CORS/ES 细节,输出到 `logs/diagnose_*.log`)。
- 卸载:`./scripts/uninstall.sh`Compose down
- ES 磁盘水位临时放宽/还原:`./scripts/es-watermark-relax.sh` / `./scripts/es-watermark-restore.sh`
## 六、下一步:分发 cluster-info.env 给 Client
- 将 `cluster-info.env` 拷贝给安装 Client 的同事;
- 对方在 Client 机器的包根放置该文件(或设置 `CLUSTER_INFO=/绝对路径`)即可。
## 七、故障排查快览
- Proxy 502 或 8080 连接复位:通常是 overlay alias 未生效或 web-proxy 尚未解析到其它服务;重跑 `install.sh`(会重启栈并在容器内校验 DNS或查看 `logs/diagnose_error.log`
- Kibana 不 available等待 12 分钟、查看 `argus-kibana-sys` 日志;
- cluster-info.env 的 SWARM_MANAGER_ADDR 为空:重新 `export SWARM_MANAGER_ADDR=<IP>; ./scripts/config.sh``./scripts/install.sh`(会回读 `.env` 补写)。

View File

@ -1,7 +0,0 @@
# Docker Swarm 部署要点
- 初始化 Swarm`docker swarm init --advertise-addr <SWARM_MANAGER_ADDR>`
- 创建 overlay`docker network create --driver overlay --attachable argus-sys-net`
- Server 包 `install.sh` 自动完成上述操作;如需手动执行,确保 `argus-sys-net` 存在且 attachable。
- Worker 节点加入:`docker swarm join --token <worker_token> <SWARM_MANAGER_ADDR>:2377`

View File

@ -1,11 +0,0 @@
# 故障排查Server
- 端口占用:查看 `安装报告_*.md` 中端口表;如需修改,编辑 `compose/.env` 后执行 `docker compose ... up -d`
- 组件未就绪:
- Master: `curl http://127.0.0.1:${MASTER_PORT}/readyz -I`
- ES: `curl http://127.0.0.1:${ES_HTTP_PORT}/_cluster/health`
- Grafana: `curl http://127.0.0.1:${GRAFANA_PORT}/api/health`
- Prometheus TCP: `exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT}`
- 域名解析:进入 `argus-web-proxy``argus-master-sys` 容器:`getent hosts master.argus.com`
- Swarm/Overlay检查 `docker network ls | grep argus-sys-net`,或 `docker node ls`

View File

@ -1,108 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_EX="$PKG_ROOT/compose/.env.example"
ENV_OUT="$PKG_ROOT/compose/.env"
info(){ echo -e "\033[34m[CONFIG]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker curl jq awk sed tar gzip
require_compose
# 磁盘空间检查MB
check_disk(){ local p="$1"; local need=10240; local free
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
}
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
# 读取/生成 SWARM_MANAGER_ADDR
SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}
if [[ -z "${SWARM_MANAGER_ADDR}" ]]; then
read -rp "请输入本机管理地址 SWARM_MANAGER_ADDR: " SWARM_MANAGER_ADDR
fi
info "SWARM_MANAGER_ADDR=$SWARM_MANAGER_ADDR"
# 校验 IP 属于本机网卡
if ! ip -o addr | awk '{print $4}' | cut -d'/' -f1 | grep -qx "$SWARM_MANAGER_ADDR"; then
err "SWARM_MANAGER_ADDR 非本机地址: $SWARM_MANAGER_ADDR"; exit 1; fi
info "开始分配服务端口(起始=20000避免系统占用与相互冲突"
is_port_used(){ local p="$1"; ss -tulnH 2>/dev/null | awk '{print $5}' | sed 's/.*://g' | grep -qx "$p"; }
declare -A PRESENT=() CHOSEN=() USED=()
START_PORT="${START_PORT:-20000}"; cur=$START_PORT
ORDER=(MASTER_PORT ES_HTTP_PORT KIBANA_PORT PROMETHEUS_PORT GRAFANA_PORT ALERTMANAGER_PORT \
WEB_PROXY_PORT_8080 WEB_PROXY_PORT_8081 WEB_PROXY_PORT_8082 WEB_PROXY_PORT_8083 WEB_PROXY_PORT_8084 WEB_PROXY_PORT_8085 \
FTP_PORT FTP_DATA_PORT)
# 标记 .env.example 中实际存在的键
for key in "${ORDER[@]}"; do
if grep -q "^${key}=" "$ENV_EX"; then PRESENT[$key]=1; fi
done
next_free(){ local p="$1"; while :; do if [[ -n "${USED[$p]:-}" ]] || is_port_used "$p"; then p=$((p+1)); else echo "$p"; return; fi; done; }
for key in "${ORDER[@]}"; do
[[ -z "${PRESENT[$key]:-}" ]] && continue
p=$(next_free "$cur"); CHOSEN[$key]="$p"; USED[$p]=1; cur=$((p+1))
done
info "端口分配结果MASTER=${CHOSEN[MASTER_PORT]:-} ES=${CHOSEN[ES_HTTP_PORT]:-} KIBANA=${CHOSEN[KIBANA_PORT]:-} PROM=${CHOSEN[PROMETHEUS_PORT]:-} GRAFANA=${CHOSEN[GRAFANA_PORT]:-} ALERT=${CHOSEN[ALERTMANAGER_PORT]:-} WEB_PROXY(8080..8085)=${CHOSEN[WEB_PROXY_PORT_8080]:-}/${CHOSEN[WEB_PROXY_PORT_8081]:-}/${CHOSEN[WEB_PROXY_PORT_8082]:-}/${CHOSEN[WEB_PROXY_PORT_8083]:-}/${CHOSEN[WEB_PROXY_PORT_8084]:-}/${CHOSEN[WEB_PROXY_PORT_8085]:-}"
cp "$ENV_EX" "$ENV_OUT"
# 覆盖端口(按唯一化结果写回)
for key in "${ORDER[@]}"; do
val="${CHOSEN[$key]:-}"
[[ -z "$val" ]] && continue
sed -i -E "s#^$key=.*#$key=${val}#" "$ENV_OUT"
done
info "已写入 compose/.env 的端口配置"
# 覆盖/补充 Overlay 名称
grep -q '^ARGUS_OVERLAY_NET=' "$ENV_OUT" || echo 'ARGUS_OVERLAY_NET=argus-sys-net' >> "$ENV_OUT"
# 以当前执行账户 UID/GID 写入(避免误选 docker 组)
RUID=$(id -u)
PRIMARY_GID=$(id -g)
PRIMARY_GRP=$(id -gn)
USER_NAME=$(id -un)
# 若主组名被解析为 docker尝试用与用户名同名的组的 GID否则回退主 GID
if [[ "$PRIMARY_GRP" == "docker" ]]; then
RGID=$(getent group "$USER_NAME" | awk -F: '{print $3}' 2>/dev/null || true)
[[ -z "$RGID" ]] && RGID="$PRIMARY_GID"
else
RGID="$PRIMARY_GID"
fi
info "使用构建账户 UID:GID=${RUID}:${RGID} (user=$USER_NAME primary_group=$PRIMARY_GRP)"
if grep -q '^ARGUS_BUILD_UID=' "$ENV_OUT"; then
sed -i -E "s#^ARGUS_BUILD_UID=.*#ARGUS_BUILD_UID=${RUID}#" "$ENV_OUT"
else
echo "ARGUS_BUILD_UID=${RUID}" >> "$ENV_OUT"
fi
if grep -q '^ARGUS_BUILD_GID=' "$ENV_OUT"; then
sed -i -E "s#^ARGUS_BUILD_GID=.*#ARGUS_BUILD_GID=${RGID}#" "$ENV_OUT"
else
echo "ARGUS_BUILD_GID=${RGID}" >> "$ENV_OUT"
fi
CI="$PKG_ROOT/cluster-info.env"
if [[ -f "$CI" ]]; then
if grep -q '^SWARM_MANAGER_ADDR=' "$CI"; then
sed -i -E "s#^SWARM_MANAGER_ADDR=.*#SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}#" "$CI"
else
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" >> "$CI"
fi
else
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" > "$CI"
fi
info "已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。"
info "下一步可执行: scripts/install.sh"

View File

@ -1,109 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
ts="$(date -u +%Y%m%d-%H%M%SZ)"
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
if ! ( : > "$LOG_DIR/.w" 2>/dev/null ); then LOG_DIR="/tmp/argus-logs"; mkdir -p "$LOG_DIR" || true; fi
# load compose project for accurate ps output
ENV_FILE="$ROOT/compose/.env"
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
DETAILS="$LOG_DIR/diagnose_details_${ts}.log"; ERRORS="$LOG_DIR/diagnose_error_${ts}.log"; : > "$DETAILS"; : > "$ERRORS"
logd() { echo "$(date '+%F %T') $*" >> "$DETAILS"; }
append_err() { echo "$*" >> "$ERRORS"; }
http_code() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
http_body_head() { curl -s --max-time 3 "$1" 2>/dev/null | sed -n '1,5p' || true; }
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
section() { local name="$1"; logd "===== [$name] ====="; }
svc() {
local svc_name="$1"; local cname="$2"; shift 2
section "$svc_name ($cname)"
logd "docker ps:"; docker ps -a --format '{{.Names}}\t{{.Status}}\t{{.Image}}' | awk -v n="$cname" '$1==n' >> "$DETAILS" || true
logd "docker inspect:"; docker inspect -f '{{.State.Status}} rc={{.RestartCount}} started={{.State.StartedAt}}' "$cname" >> "$DETAILS" 2>&1 || true
logd "last 200 container logs:"; docker logs --tail 200 "$cname" >> "$DETAILS" 2>&1 || true
docker logs --tail 200 "$cname" 2>&1 | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][container] /" >> "$ERRORS" || true
if docker exec "$cname" sh -lc 'command -v supervisorctl >/dev/null 2>&1' >/dev/null 2>&1; then
logd "supervisorctl status:"; docker exec "$cname" sh -lc 'supervisorctl status' >> "$DETAILS" 2>&1 || true
local files; files=$(docker exec "$cname" sh -lc 'ls /var/log/supervisor/*.log 2>/dev/null' || true)
for f in $files; do
logd "tail -n 80 $f:"; docker exec "$cname" sh -lc "tail -n 80 $f 2>/dev/null || true" >> "$DETAILS" 2>&1 || true
docker exec "$cname" sh -lc "tail -n 200 $f 2>/dev/null" 2>/dev/null | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][supervisor:$(basename $f)] /" >> "$ERRORS" || true
done
fi
}
svc master argus-master-sys
svc es argus-es-sys
svc kibana argus-kibana-sys
svc prometheus argus-prometheus
svc grafana argus-grafana
svc alertmanager argus-alertmanager
svc web-frontend argus-web-frontend
svc web-proxy argus-web-proxy
section HTTP
logd "ES: $(http_code \"http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health\")"; http_body_head "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" >> "$DETAILS" 2>&1 || true
logd "Kibana: $(http_code \"http://localhost:${KIBANA_PORT:-5601}/api/status\")"; http_body_head "http://localhost:${KIBANA_PORT:-5601}/api/status" >> "$DETAILS" 2>&1 || true
logd "Master readyz: $(http_code \"http://localhost:${MASTER_PORT:-32300}/readyz\")"
logd "Prometheus: $(http_code \"http://localhost:${PROMETHEUS_PORT:-9090}/-/ready\")"
logd "Grafana: $(http_code \"http://localhost:${GRAFANA_PORT:-3000}/api/health\")"; http_body_head "http://localhost:${GRAFANA_PORT:-3000}/api/health" >> "$DETAILS" 2>&1 || true
logd "Alertmanager: $(http_code \"http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status\")"
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
logd "Web-Proxy 8080: $(http_code \"http://localhost:${WEB_PROXY_PORT_8080:-8080}/\")"
logd "Web-Proxy 8083: $(http_code \"http://localhost:${WEB_PROXY_PORT_8083:-8083}/\")"
logd "Web-Proxy 8084 CORS: ${cors8084}"
logd "Web-Proxy 8085 CORS: ${cors8085}"
section ES-CHECKS
ch=$(curl -s --max-time 3 "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" || true)
status=$(printf '%s' "$ch" | awk -F'\"' '/"status"/{print $4; exit}')
if [[ -n "$status" ]]; then logd "cluster.status=$status"; fi
if [[ "$status" != "green" ]]; then append_err "[es][cluster] status=$status"; fi
if docker ps --format '{{.Names}}' | grep -q '^argus-es-sys$'; then
duse=$(docker exec argus-es-sys sh -lc 'df -P /usr/share/elasticsearch/data | awk "NR==2{print \$5}"' 2>/dev/null || true)
logd "es.data.df_use=$duse"; usep=${duse%%%}
if [[ -n "$usep" ]] && (( usep >= 90 )); then append_err "[es][disk] data path usage=${usep}%"; fi
fi
section DNS-IN-PROXY
for d in master.argus.com es.log.argus.com kibana.log.argus.com grafana.metric.argus.com prom.metric.argus.com alertmanager.alert.argus.com; do
docker exec argus-web-proxy sh -lc "getent hosts $d || nslookup $d 2>/dev/null | tail -n+1" >> "$DETAILS" 2>&1 || true
done
logd "HTTP (web-proxy): master.readyz=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://master.argus.com:3000/readyz\" 2>/dev/null || echo 000)"
logd "HTTP (web-proxy): es.health=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://es.log.argus.com:9200/_cluster/health\" 2>/dev/null || echo 000)"
logd "HTTP (web-proxy): kibana.status=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://kibana.log.argus.com:5601/api/status\" 2>/dev/null || echo 000)"
section SYSTEM
logd "uname -a:"; uname -a >> "$DETAILS"
logd "docker version:"; docker version --format '{{.Server.Version}}' >> "$DETAILS" 2>&1 || true
logd "compose ps (project=$PROJECT):"; (cd "$ROOT/compose" && docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f docker-compose.yml ps) >> "$DETAILS" 2>&1 || true
section SUMMARY
[[ $(http_code "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health") != 200 ]] && echo "[es][http] health not 200" >> "$ERRORS"
kbcode=$(http_code "http://localhost:${KIBANA_PORT:-5601}/api/status"); [[ "$kbcode" != 200 ]] && echo "[kibana][http] /api/status=$kbcode" >> "$ERRORS"
[[ $(http_code "http://localhost:${MASTER_PORT:-32300}/readyz") != 200 ]] && echo "[master][http] /readyz not 200" >> "$ERRORS"
[[ $(http_code "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready") != 200 ]] && echo "[prometheus][http] /-/ready not 200" >> "$ERRORS"
gfcode=$(http_code "http://localhost:${GRAFANA_PORT:-3000}/api/health"); [[ "$gfcode" != 200 ]] && echo "[grafana][http] /api/health=$gfcode" >> "$ERRORS"
[[ $(http_code "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status") != 200 ]] && echo "[alertmanager][http] /api/v2/status not 200" >> "$ERRORS"
[[ -z "$cors8084" ]] && echo "[web-proxy][cors] 8084 missing Access-Control-Allow-Origin" >> "$ERRORS"
[[ -z "$cors8085" ]] && echo "[web-proxy][cors] 8085 missing Access-Control-Allow-Origin" >> "$ERRORS"
sort -u -o "$ERRORS" "$ERRORS"
echo "Diagnostic details -> $DETAILS"
echo "Detected errors -> $ERRORS"
if [[ "$LOG_DIR" == "$ROOT/logs" ]]; then
ln -sfn "$(basename "$DETAILS")" "$ROOT/logs/diagnose_details.log" 2>/dev/null || cp "$DETAILS" "$ROOT/logs/diagnose_details.log" 2>/dev/null || true
ln -sfn "$(basename "$ERRORS")" "$ROOT/logs/diagnose_error.log" 2>/dev/null || cp "$ERRORS" "$ROOT/logs/diagnose_error.log" 2>/dev/null || true
fi
exit 0

View File

@ -1,11 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
HOST="${1:-http://127.0.0.1:9200}"
echo "设置 ES watermark 为 95%/96%/97%: $HOST"
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "95%",
"cluster.routing.allocation.disk.watermark.high": "96%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
}
}' && echo "\nOK"

View File

@ -1,11 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
HOST="${1:-http://127.0.0.1:9200}"
echo "恢复 ES watermark 为默认值: $HOST"
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
}
}' && echo "\nOK"

View File

@ -1,137 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
info(){ echo -e "\033[34m[INSTALL]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/devnull 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker curl jq awk sed tar gzip
require_compose
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env请先运行 scripts/config.sh"; exit 1; }
info "使用环境文件: $ENV_FILE"
set -a; source "$ENV_FILE"; set +a
# 兼容:若 .env 未包含 SWARM_MANAGER_ADDR则从已存在的 cluster-info.env 读取以避免写空
SMADDR="${SWARM_MANAGER_ADDR:-}"
CI_FILE="$PKG_ROOT/cluster-info.env"
if [[ -z "$SMADDR" && -f "$CI_FILE" ]]; then
SMADDR=$(sed -n 's/^SWARM_MANAGER_ADDR=\(.*\)$/\1/p' "$CI_FILE" | head -n1)
fi
SWARM_MANAGER_ADDR="$SMADDR"
# Swarm init & overlay
if ! docker info 2>/dev/null | grep -q "Swarm: active"; then
[[ -n "${SWARM_MANAGER_ADDR:-}" ]] || { err "SWARM_MANAGER_ADDR 未设置,请在 scripts/config.sh 中配置"; exit 1; }
info "初始化 Swarm (--advertise-addr $SWARM_MANAGER_ADDR)"
docker swarm init --advertise-addr "$SWARM_MANAGER_ADDR" >/dev/null 2>&1 || true
else
info "Swarm 已激活"
fi
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
info "创建 overlay 网络: $NET_NAME"
docker network create -d overlay --attachable "$NET_NAME" >/dev/null
else
info "overlay 网络已存在: $NET_NAME"
fi
# Load images
IMAGES_DIR="$PKG_ROOT/images"
shopt -s nullglob
tars=("$IMAGES_DIR"/*.tar.gz)
if [[ ${#tars[@]} -eq 0 ]]; then err "images 目录为空,缺少镜像 tar.gz"; exit 1; fi
total=${#tars[@]}; idx=0
for tgz in "${tars[@]}"; do
idx=$((idx+1))
info "导入镜像 ($idx/$total): $(basename "$tgz")"
tmp=$(mktemp); gunzip -c "$tgz" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
done
shopt -u nullglob
# Compose up
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
info "启动服务栈 (docker compose -p $PROJECT up -d)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
# Wait readiness (best-effort)
code(){ curl -4 -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
prom_ok(){ (exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT:-9090}) >/dev/null 2>&1 && return 0 || return 1; }
kb_ok(){ local body; body=$(curl -s "http://127.0.0.1:${KIBANA_PORT:-5601}/api/status" || true); echo "$body" | grep -q '"level"\s*:\s*"available"'; }
RETRIES=${RETRIES:-60}; SLEEP=${SLEEP:-5}; ok=0
info "等待基础服务就绪 (<= $((RETRIES*SLEEP))s)"
for i in $(seq 1 "$RETRIES"); do
e1=$(code "http://127.0.0.1:${MASTER_PORT:-32300}/readyz")
e2=$(code "http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health")
e3=000; prom_ok && e3=200
e4=$(code "http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health")
e5=$(code "http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status")
e6=$(kb_ok && echo 200 || echo 000)
info "[ready] t=$((i*SLEEP))s master=$e1 es=$e2 prom=$e3 graf=$e4 alert=$e5 kibana=$e6"
[[ "$e1" == 200 ]] && ok=$((ok+1))
[[ "$e2" == 200 ]] && ok=$((ok+1))
[[ "$e3" == 200 ]] && ok=$((ok+1))
[[ "$e4" == 200 ]] && ok=$((ok+1))
[[ "$e5" == 200 ]] && ok=$((ok+1))
[[ "$e6" == 200 ]] && ok=$((ok+1))
if [[ $ok -ge 6 ]]; then break; fi; ok=0; sleep "$SLEEP"
done
[[ $ok -ge 6 ]] || err "部分服务未就绪(可稍后重试 selfcheck"
# Swarm join tokens
TOKEN_WORKER=$(docker swarm join-token -q worker 2>/dev/null || echo "")
TOKEN_MANAGER=$(docker swarm join-token -q manager 2>/dev/null || echo "")
# cluster-info.envcompose 场景下不再依赖 BINDIP/FTPIP
CI="$PKG_ROOT/cluster-info.env"
info "写入 cluster-info.env (manager/token)"
{
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
echo "SWARM_JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
echo "SWARM_JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
} > "$CI"
info "已输出 $CI"
# 安装报告
ts=$(date +%Y%m%d-%H%M%S)
RPT="$PKG_ROOT/安装报告_${ts}.md"
{
echo "# Argus Server 安装报告 (${ts})"
echo
echo "## 端口映射"
echo "- MASTER_PORT=${MASTER_PORT}"
echo "- ES_HTTP_PORT=${ES_HTTP_PORT}"
echo "- KIBANA_PORT=${KIBANA_PORT}"
echo "- PROMETHEUS_PORT=${PROMETHEUS_PORT}"
echo "- GRAFANA_PORT=${GRAFANA_PORT}"
echo "- ALERTMANAGER_PORT=${ALERTMANAGER_PORT}"
echo "- WEB_PROXY_PORT_8080=${WEB_PROXY_PORT_8080} ... 8085=${WEB_PROXY_PORT_8085}"
echo
echo "## Swarm/Overlay"
echo "- SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
echo "- NET=${NET_NAME}"
echo "- JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
echo "- JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
echo
echo "## 健康检查(简要)"
echo "- master/readyz=$(code http://127.0.0.1:${MASTER_PORT:-32300}/readyz)"
echo "- es/_cluster/health=$(code http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health)"
echo "- grafana/api/health=$(code http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health)"
echo "- prometheus/tcp=$([[ $(prom_ok; echo $?) == 0 ]] && echo 200 || echo 000)"
echo "- alertmanager/api/v2/status=$(code http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status)"
echo "- kibana/api/status=$([[ $(kb_ok; echo $?) == 0 ]] && echo available || echo not-ready)"
} > "$RPT"
info "已生成报告: $RPT"
info "安装完成。可将 cluster-info.env 分发给 Client-GPU 安装方。"
docker exec argus-web-proxy nginx -t >/dev/null 2>&1 && docker exec argus-web-proxy nginx -s reload >/dev/null 2>&1 || true

View File

@ -1,83 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
log() { echo -e "\033[0;34m[CHECK]\033[0m $*"; }
err() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
wait_http() { local url="$1"; local attempts=${2:-120}; local i=1; while ((i<=attempts)); do curl -fsS "$url" >/dev/null 2>&1 && return 0; echo "[..] waiting $url ($i/$attempts)"; sleep 5; ((i++)); done; return 1; }
code_for() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
OUT_JSON="$LOG_DIR/selfcheck.json"; tmp=$(mktemp)
ok=1
log "checking overlay network"
net_ok=false
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" >/dev/null 2>&1; then
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" | grep -q '"Driver": "overlay"'; then net_ok=true; fi
fi
[[ "$net_ok" == true ]] || ok=0
log "checking Elasticsearch"
wait_http "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" 60 || ok=0
log "checking Kibana"
kb_code=$(code_for "http://localhost:${KIBANA_PORT:-5601}/api/status")
kb_ok=false
if [[ "$kb_code" == 200 ]]; then
body=$(curl -sS "http://localhost:${KIBANA_PORT:-5601}/api/status" || true)
echo "$body" | grep -q '"level"\s*:\s*"available"' && kb_ok=true
fi
[[ "$kb_ok" == true ]] || ok=0
log "checking Master"
[[ $(code_for "http://localhost:${MASTER_PORT:-32300}/readyz") == 200 ]] || ok=0
log "checking Prometheus"
wait_http "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready" 60 || ok=0
log "checking Grafana"
gf_code=$(code_for "http://localhost:${GRAFANA_PORT:-3000}/api/health")
gf_ok=false; if [[ "$gf_code" == 200 ]]; then body=$(curl -sS "http://localhost:${GRAFANA_PORT:-3000}/api/health" || true); echo "$body" | grep -q '"database"\s*:\s*"ok"' && gf_ok=true; fi
[[ "$gf_ok" == true ]] || ok=0
log "checking Alertmanager"
wait_http "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status" 60 || ok=0
log "checking Web-Proxy (CORS)"
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
wp_ok=true
[[ -n "$cors8084" && -n "$cors8085" ]] || wp_ok=false
[[ "$wp_ok" == true ]] || ok=0
cat > "$tmp" <<JSON
{
"overlay_net": $net_ok,
"es": true,
"kibana": $kb_ok,
"master_readyz": true,
"prometheus": true,
"grafana": $gf_ok,
"alertmanager": true,
"web_proxy_cors": $wp_ok,
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
JSON
mv "$tmp" "$OUT_JSON" 2>/dev/null || cp "$tmp" "$OUT_JSON"
if [[ "$ok" == 1 ]]; then
log "selfcheck OK -> $OUT_JSON"
exit 0
else
err "selfcheck FAILED -> $OUT_JSON"
exit 1
fi

View File

@ -1,9 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps

View File

@ -1,23 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
# load COMPOSE_PROJECT_NAME from env file if present
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require_compose
echo "[UNINSTALL] stopping compose (project=$PROJECT)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
echo "[UNINSTALL] done"

Binary file not shown.

View File

@ -37,11 +37,22 @@ _argus_is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
_argus_read_user_from_files() {
local uid_out_var="$1" gid_out_var="$2"; shift 2
local uid_val="$ARGUS_BUILD_UID_DEFAULT" gid_val="$ARGUS_BUILD_GID_DEFAULT"
local config
for config in "$@"; do
load_build_user() {
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
return 0
fi
local project_root config_files config uid gid
project_root="$(argus_project_root)"
config_files=(
"$project_root/configs/build_user.local.conf"
"$project_root/configs/build_user.conf"
)
uid="$ARGUS_BUILD_UID_DEFAULT"
gid="$ARGUS_BUILD_GID_DEFAULT"
for config in "${config_files[@]}"; do
if [[ -f "$config" ]]; then
while IFS= read -r raw_line || [[ -n "$raw_line" ]]; do
local line key value
@ -57,58 +68,42 @@ _argus_read_user_from_files() {
key="$(_argus_trim "$key")"
value="$(_argus_trim "$value")"
case "$key" in
UID) uid_val="$value" ;;
GID) gid_val="$value" ;;
*) echo "[ARGUS build_user] Unknown key '$key' in $config" >&2 ;;
UID)
uid="$value"
;;
GID)
gid="$value"
;;
*)
echo "[ARGUS build_user] Unknown key '$key' in $config" >&2
;;
esac
done < "$config"
break
fi
done
printf -v "$uid_out_var" '%s' "$uid_val"
printf -v "$gid_out_var" '%s' "$gid_val"
}
load_build_user_profile() {
local profile="${1:-default}"
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
return 0
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then
uid="$ARGUS_BUILD_UID"
fi
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then
gid="$ARGUS_BUILD_GID"
fi
local project_root uid gid
project_root="$(argus_project_root)"
case "$profile" in
pkg)
_argus_read_user_from_files uid gid \
"$project_root/configs/build_user.pkg.conf" \
"$project_root/configs/build_user.local.conf" \
"$project_root/configs/build_user.conf"
;;
default|*)
_argus_read_user_from_files uid gid \
"$project_root/configs/build_user.local.conf" \
"$project_root/configs/build_user.conf"
;;
esac
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then uid="$ARGUS_BUILD_UID"; fi
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then gid="$ARGUS_BUILD_GID"; fi
if ! _argus_is_number "$uid"; then
echo "[ARGUS build_user] Invalid UID '$uid'" >&2; return 1
echo "[ARGUS build_user] Invalid UID '$uid'" >&2
return 1
fi
if ! _argus_is_number "$gid"; then
echo "[ARGUS build_user] Invalid GID '$gid'" >&2; return 1
echo "[ARGUS build_user] Invalid GID '$gid'" >&2
return 1
fi
export ARGUS_BUILD_UID="$uid"
export ARGUS_BUILD_GID="$gid"
_ARGUS_BUILD_USER_LOADED=1
}
load_build_user() {
local profile="${ARGUS_BUILD_PROFILE:-default}"
load_build_user_profile "$profile"
}
argus_build_user_args() {
load_build_user
printf '%s' "--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} --build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID}"

View File

@ -3,4 +3,3 @@ build/
__pycache__/
.env
dist/

View File

@ -4,7 +4,6 @@ import os
import re
import socket
import subprocess
import ipaddress
from pathlib import Path
from typing import Any, Dict
@ -17,47 +16,11 @@ _HOSTNAME_PATTERN = re.compile(r"^([^-]+)-([^-]+)-([^-]+)-.*$")
def collect_metadata(config: AgentConfig) -> Dict[str, Any]:
"""汇总节点注册需要的静态信息,带有更智能的 IP 选择。
规则从高到低
1) AGENT_PUBLISH_IP 指定
2) Hostname A 记录若命中优先网段
3) 网卡扫描排除 AGENT_EXCLUDE_IFACES优先 AGENT_PREFER_NET_CIDRS
4) 默认路由回退UDP socket 技巧
额外发布overlay_ip / gwbridge_ip / interfaces便于 Master 与诊断使用
"""
"""汇总节点注册需要的静态信息。"""
hostname = config.hostname
prefer_cidrs = _read_cidrs_env(
os.environ.get("AGENT_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16")
)
exclude_ifaces = _read_csv_env(
os.environ.get("AGENT_EXCLUDE_IFACES", "docker_gwbridge,lo")
)
# interface inventory
interfaces = _list_global_ipv4_addrs()
if exclude_ifaces:
interfaces = [it for it in interfaces if it[0] not in set(exclude_ifaces)]
# resolve hostname candidates
host_ips = _resolve_hostname_ips(hostname)
selected_ip, overlay_ip, gwbridge_ip = _select_publish_ips(
interfaces=interfaces,
host_ips=host_ips,
prefer_cidrs=prefer_cidrs,
)
meta: Dict[str, Any] = {
meta = {
"hostname": hostname,
"ip": os.environ.get("AGENT_PUBLISH_IP", selected_ip), # keep required field
"overlay_ip": overlay_ip or selected_ip,
"gwbridge_ip": gwbridge_ip,
"interfaces": [
{"iface": name, "ip": ip} for name, ip in interfaces
],
"ip": _detect_ip_address(),
"env": config.environment,
"user": config.user,
"instance": config.instance,
@ -133,7 +96,7 @@ def _detect_gpu_count() -> int:
def _detect_ip_address() -> str:
"""保留旧接口,作为最终回退:默认路由源地址 → 主机名解析 → 127.0.0.1"""
"""尝试通过 UDP socket 获得容器出口 IP失败则回退解析主机名"""
try:
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
sock.connect(("8.8.8.8", 80))
@ -145,118 +108,3 @@ def _detect_ip_address() -> str:
except OSError:
LOGGER.warning("Unable to resolve hostname to IP; defaulting to 127.0.0.1")
return "127.0.0.1"
def _read_csv_env(raw: str | None) -> list[str]:
if not raw:
return []
return [x.strip() for x in raw.split(",") if x.strip()]
def _read_cidrs_env(raw: str | None) -> list[ipaddress.IPv4Network]:
cidrs: list[ipaddress.IPv4Network] = []
for item in _read_csv_env(raw):
try:
net = ipaddress.ip_network(item, strict=False)
if isinstance(net, (ipaddress.IPv4Network,)):
cidrs.append(net)
except ValueError:
LOGGER.warning("Ignoring invalid CIDR in AGENT_PREFER_NET_CIDRS", extra={"cidr": item})
return cidrs
def _list_global_ipv4_addrs() -> list[tuple[str, str]]:
"""列出 (iface, ip) 形式的全局 IPv4 地址。
依赖 iproute2ip -4 -o addr show scope global
"""
results: list[tuple[str, str]] = []
try:
proc = subprocess.run(
["sh", "-lc", "ip -4 -o addr show scope global | awk '{print $2, $4}'"],
check=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=3,
)
if proc.returncode == 0:
for line in proc.stdout.splitlines():
line = line.strip()
if not line:
continue
parts = line.split()
if len(parts) != 2:
continue
iface, cidr = parts
ip = cidr.split("/")[0]
try:
ipaddress.IPv4Address(ip)
except ValueError:
continue
results.append((iface, ip))
except Exception as exc: # pragma: no cover - defensive
LOGGER.debug("Failed to list interfaces", extra={"error": str(exc)})
return results
def _resolve_hostname_ips(name: str) -> list[str]:
ips: list[str] = []
try:
infos = socket.getaddrinfo(name, None, family=socket.AF_INET)
for info in infos:
ip = info[4][0]
if ip not in ips:
ips.append(ip)
except OSError:
pass
return ips
def _pick_by_cidrs(candidates: list[str], prefer_cidrs: list[ipaddress.IPv4Network]) -> str | None:
for net in prefer_cidrs:
for ip in candidates:
try:
if ipaddress.ip_address(ip) in net:
return ip
except ValueError:
continue
return None
def _select_publish_ips(
*,
interfaces: list[tuple[str, str]],
host_ips: list[str],
prefer_cidrs: list[ipaddress.IPv4Network],
) -> tuple[str, str | None, str | None]:
"""返回 (selected_ip, overlay_ip, gwbridge_ip)。
- overlay_ip优先命中 prefer_cidrs10.0/8 先于 172.31/16
- gwbridge_ip若存在 172.22/16 则记录
- selected_ip优先 AGENT_PUBLISH_IP否则 overlay_ip否则 hostname A 记录中的 prefer否则默认路由回退
"""
# detect gwbridge (172.22/16)
gwbridge_net = ipaddress.ip_network("172.22.0.0/16")
gwbridge_ip = None
for _, ip in interfaces:
try:
if ipaddress.ip_address(ip) in gwbridge_net:
gwbridge_ip = ip
break
except ValueError:
continue
# overlay candidate from interfaces by prefer cidrs
iface_ips = [ip for _, ip in interfaces]
overlay_ip = _pick_by_cidrs(iface_ips, prefer_cidrs)
# hostname A records filtered by prefer cidrs
host_pref = _pick_by_cidrs(host_ips, prefer_cidrs)
env_ip = os.environ.get("AGENT_PUBLISH_IP")
if env_ip:
selected = env_ip
else:
selected = overlay_ip or host_pref or _detect_ip_address()
return selected, overlay_ip, gwbridge_ip

BIN
src/agent/dist/argus-agent vendored Executable file

Binary file not shown.

View File

@ -31,31 +31,26 @@ RUN mkdir -p /usr/share/alertmanager && \
rm -rf /alertmanager && \
ln -s ${ALERTMANAGER_BASE_PATH} /alertmanager
# 确保 ubuntu 账户存在并使用 ARGUS_BUILD_UID/GID
# 创建 alertmanager 用户(可自定义 UID/GID
# 创建 alertmanager 用户组
RUN set -eux; \
# 确保存在目标 GID 的组;若不存在则优先尝试将 ubuntu 组改为该 GID否则创建新组
if getent group "${ARGUS_BUILD_GID}" >/dev/null; then \
:; \
else \
if getent group ubuntu >/dev/null; then \
groupmod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
else \
groupadd -g "${ARGUS_BUILD_GID}" ubuntu || groupadd -g "${ARGUS_BUILD_GID}" argus || true; \
fi; \
# 确保目标 GID 存在;若已被占用,直接使用该 GID组名不限\
if ! getent group "${ARGUS_BUILD_GID}" >/dev/null; then \
groupadd -g "${ARGUS_BUILD_GID}" alertmanager || true; \
fi; \
# 创建或调整 ubuntu 用户
if id ubuntu >/dev/null 2>&1; then \
# 设置主组为目标 GID可用 GID 数字指定)
usermod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
# 若目标 UID 未被占用,则更新 ubuntu 的 UID
if [ "$(id -u ubuntu)" != "${ARGUS_BUILD_UID}" ] && ! getent passwd "${ARGUS_BUILD_UID}" >/dev/null; then \
usermod -u "${ARGUS_BUILD_UID}" ubuntu || true; \
fi; \
# 确保存在 alertmanager 用户;若 UID 已被占用,跳过并继续使用现有 UID 的用户
if ! id alertmanager >/dev/null 2>&1; then \
if getent passwd "${ARGUS_BUILD_UID}" >/dev/null; then \
# UID 已占用,则创建同名用户但不指定 UID避免冲突仅保证 user 存在
useradd -M -s /usr/sbin/nologin -g "${ARGUS_BUILD_GID}" alertmanager || true; \
else \
useradd -M -s /usr/sbin/nologin -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" alertmanager || true; \
fi; \
else \
useradd -m -s /bin/bash -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" ubuntu || true; \
fi; \
# 调整关键目录属主为 ubuntu UID/GID
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /usr/share/alertmanager /alertmanager ${ALERTMANAGER_BASE_PATH} /private/argus/etc /usr/local/bin || true
usermod -g "${ARGUS_BUILD_GID}" alertmanager || true; \
fi
RUN chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /usr/share/alertmanager /alertmanager ${ALERTMANAGER_BASE_PATH} /private/argus/etc /usr/local/bin || true
# 配置内网 apt 源 (如果指定了内网选项)
RUN if [ "$USE_INTRANET" = "true" ]; then \

View File

@ -1,5 +1,5 @@
DATA_ROOT=/home/argus/tmp/private/argus
ARGUS_BUILD_UID=1048
ARGUS_BUILD_GID=1048
ARGUS_UID=1048
ARGUS_GID=1048
USE_INTRANET=false
USE_INTRANET=false

View File

@ -0,0 +1,19 @@
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance'] # 分组:相同 alertname + instance 的告警合并
group_wait: 30s # 第一个告警后,等 30s 看是否有同组告警一起发
group_interval: 5m # 同组告警变化后,至少 5 分钟再发一次
repeat_interval: 3h # 相同告警3 小时重复提醒一次
receiver: 'null'
receivers:
- name: 'null'
inhibit_rules:
- source_match:
severity: 'critical' # critical 告警存在时
target_match:
severity: 'warning' # 抑制相同 instance 的 warning 告警
equal: ['instance']

View File

View File

@ -0,0 +1 @@
172.18.0.2

View File

@ -0,0 +1,19 @@
#!/usr/bin/env bash
set -euo pipefail
root="$(cd "$(dirname "${BASH_SOURCE[0]}")/../" && pwd)"
project_root="$(cd "$root/../../.." && pwd)"
source "$project_root/scripts/common/build_user.sh"
load_build_user
# 创建新的private目录结构 (基于argus目录结构)
echo "[INFO] Creating private directory structure for supervisor-based containers..."
mkdir -p "$root/private/argus/alert/alertmanager"
mkdir -p "$root/private/argus/etc/"
# 设置数据目录权限
echo "[INFO] Setting permissions for data directories..."
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" "$root/private/argus/alert/alertmanager" 2>/dev/null || true
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" "$root/private/argus/etc" 2>/dev/null || true
echo "[INFO] Supervisor-based containers will manage their own scripts and configurations"

View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")/.."
compose_cmd="docker compose"
if ! $compose_cmd version >/dev/null 2>&1; then
if command -v docker-compose >/dev/null 2>&1; then compose_cmd="docker-compose"; else
echo "需要 Docker Compose请安装后重试" >&2; exit 1; fi
fi
$compose_cmd -p alert-mvp up -d --remove-orphans
echo "[OK] 服务已启动Alertmanager http://localhost:9093"

View File

@ -0,0 +1,106 @@
#!/bin/bash
set -euo pipefail
# ==========================================================
# Alertmanager 测试脚本
# ==========================================================
ALERTMANAGER_URL="http://localhost:9093"
TEST_ALERT_NAME_CRITICAL="NodeDown"
TEST_ALERT_NAME_WARNING="HighCPU"
TMP_LOG="/tmp/test-alertmanager.log"
# 等待参数
am_wait_attempts=30
am_wait_interval=2
GREEN="\033[1;32m"
RED="\033[1;31m"
YELLOW="\033[1;33m"
RESET="\033[0m"
# ==========================================================
# 函数定义
# ==========================================================
wait_for_alertmanager() {
local attempt=1
echo "[INFO] 等待 Alertmanager 启动中..."
while (( attempt <= am_wait_attempts )); do
if curl -fsS "${ALERTMANAGER_URL}/api/v2/status" >/dev/null 2>&1; then
echo -e "${GREEN}[OK] Alertmanager 已就绪 (attempt=${attempt}/${am_wait_attempts})${RESET}"
return 0
fi
echo "[..] Alertmanager 尚未就绪 (${attempt}/${am_wait_attempts})"
sleep "${am_wait_interval}"
(( attempt++ ))
done
echo -e "${RED}[ERROR] Alertmanager 在 ${am_wait_attempts} 次尝试后仍未就绪${RESET}"
return 1
}
log_step() {
echo -e "${YELLOW}==== $1 ====${RESET}"
}
# ==========================================================
# 主流程
# ==========================================================
log_step "测试 Alertmanager 开始"
echo "[INFO] Alertmanager 地址: $ALERTMANAGER_URL"
# Step 1: 等待 Alertmanager 启动
wait_for_alertmanager
# Step 2: 触发一个critical测试告警
echo "[INFO] 发送critical测试告警..."
curl -fsS -X POST "${ALERTMANAGER_URL}/api/v2/alerts" \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "'"${TEST_ALERT_NAME_CRITICAL}"'",
"instance": "node-1",
"severity": "critical"
},
"annotations": {
"summary": "节点 node-1 宕机"
}
}
]' \
-o "$TMP_LOG"
if [ $? -eq 0 ]; then
echo -e "${GREEN}[OK] 已成功发送critical测试告警${RESET}"
else
echo -e "${RED}[ERROR] critical告警发送失败${RESET}"
cat "$TMP_LOG"
exit 1
fi
# Step 3: 触发一个warning测试告警
echo "[INFO] 发送warning测试告警..."
curl -fsS -X POST "${ALERTMANAGER_URL}/api/v2/alerts" \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "'"${TEST_ALERT_NAME_WARNING}"'",
"instance": "node-1",
"severity": "warning"
},
"annotations": {
"summary": "节点 node-1 CPU 使用率过高"
}
}
]' \
-o "$TMP_LOG"
if [ $? -eq 0 ]; then
echo -e "${GREEN}[OK] 已成功发送warning测试告警${RESET}"
else
echo -e "${RED}[ERROR] warning告警发送失败${RESET}"
cat "$TMP_LOG"
exit 1
fi

View File

@ -0,0 +1,71 @@
#!/bin/bash
set -euo pipefail
# ==========================================================
# Alertmanager 测试脚本(含启动等待)
# ==========================================================
ALERTMANAGER_URL="http://localhost:9093"
TEST_ALERT_NAME_CRITICAL="NodeDown"
TEST_ALERT_NAME_WARNING="HighCPU"
TMP_LOG="/tmp/test-alertmanager.log"
# 等待参数
am_wait_attempts=30
am_wait_interval=2
GREEN="\033[1;32m"
RED="\033[1;31m"
YELLOW="\033[1;33m"
RESET="\033[0m"
# ==========================================================
# 函数定义
# ==========================================================
wait_for_alertmanager() {
local attempt=1
echo "[INFO] 等待 Alertmanager 启动中..."
while (( attempt <= am_wait_attempts )); do
if curl -fsS "${ALERTMANAGER_URL}/api/v2/status" >/dev/null 2>&1; then
echo -e "${GREEN}[OK] Alertmanager 已就绪 (attempt=${attempt}/${am_wait_attempts})${RESET}"
return 0
fi
echo "[..] Alertmanager 尚未就绪 (${attempt}/${am_wait_attempts})"
sleep "${am_wait_interval}"
(( attempt++ ))
done
echo -e "${RED}[ERROR] Alertmanager 在 ${am_wait_attempts} 次尝试后仍未就绪${RESET}"
return 1
}
log_step() {
echo -e "${YELLOW}==== $1 ====${RESET}"
}
# ==========================================================
# 主流程
# ==========================================================
log_step "查询 Alertmanager 当前告警列表开始"
echo "[INFO] Alertmanager 地址: $ALERTMANAGER_URL"
# Step 1: 等待 Alertmanager 启动
wait_for_alertmanager
# Step 2: 查询当前告警列表
echo "[INFO] 查询当前告警..."
sleep 1
curl -fsS "${ALERTMANAGER_URL}/api/v2/alerts" | jq '.' || {
echo -e "${RED}[WARN] 无法解析返回 JSON请检查 jq 是否安装${RESET}"
curl -s "${ALERTMANAGER_URL}/api/v2/alerts"
}
# Step 3: 检查告警是否包含 NodeDown
if curl -fsS "${ALERTMANAGER_URL}/api/v2/alerts" | grep -q "${TEST_ALERT_NAME_CRITICAL}"; then
echo -e "${GREEN}✅ 测试通过Alertmanager 已成功接收告警 ${TEST_ALERT_NAME_CRITICAL}${RESET}"
else
echo -e "${RED}❌ 测试失败:未检测到告警 ${TEST_ALERT_NAME_CRITICAL}${RESET}"
fi
log_step "测试结束"

View File

@ -0,0 +1,21 @@
#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")/.."
compose_cmd="docker compose"
if ! $compose_cmd version >/dev/null 2>&1; then
if command -v docker-compose >/dev/null 2>&1; then compose_cmd="docker-compose"; else
echo "需要 Docker Compose请安装后重试" >&2; exit 1; fi
fi
$compose_cmd -p alert-mvp down
echo "[OK] 已停止所有容器"
# 清理private目录内容
echo "[INFO] 清理private目录内容..."
cd "$(dirname "$0")/.."
if [ -d "private" ]; then
# 删除private目录及其所有内容
rm -rf private
echo "[OK] 已清理private目录"
else
echo "[INFO] private目录不存在无需清理"
fi

View File

@ -0,0 +1,105 @@
#!/usr/bin/env bash
set -euo pipefail
echo "======================================="
echo "ARGUS Alert System End-to-End Test"
echo "======================================="
echo ""
# 记录测试开始时间
test_start_time=$(date +%s)
# 函数:等待服务就绪
wait_for_services() {
echo "[INFO] Waiting for all services to be ready..."
local max_attempts=${SERVICE_WAIT_ATTEMPTS:-120}
local attempt=1
while [ $attempt -le $max_attempts ]; do
if curl -fs http://localhost:9093/api/v2/status >/dev/null 2>&1; then
echo "[OK] All services are ready!"
return 0
fi
echo " Waiting for services... ($attempt/$max_attempts)"
sleep 5
((attempt++))
done
echo "[ERROR] Services not ready after $max_attempts attempts"
return 1
}
# 函数:显示测试步骤
show_step() {
echo ""
echo "🔄 Step $1: $2"
echo "----------------------------------------"
}
# 函数:验证步骤结果
verify_step() {
if [ $? -eq 0 ]; then
echo "$1 - SUCCESS"
else
echo "$1 - FAILED"
exit 1
fi
}
# 开始端到端测试
show_step "1" "Bootstrap - Initialize environment"
./scripts/01_bootstrap.sh
verify_step "Bootstrap"
show_step "2" "Startup - Start all services"
./scripts/02_up.sh
verify_step "Service startup"
# 等待服务完全就绪
wait_for_services || exit 1
# 发送告警数据
show_step "3" "Add alerts - Send test alerts to Alertmanager"
./scripts/03_alertmanager_add_alert.sh
verify_step "Send test alerts"
# 查询告警数据
show_step "4" "Verify data - Query Alertmanager"
./scripts/04_query_alerts.sh
verify_step "Data verification"
# 检查服务健康状态
show_step "Health" "Check service health"
echo "[INFO] Checking service health..."
# 检查 Alertmanager 状态
if curl -fs "http://localhost:9093/api/v2/status" >/dev/null 2>&1; then
am_status="available"
echo "✅ Alertmanager status: $am_status"
else
am_status="unavailable"
echo "⚠️ Alertmanager status: $am_status"
fi
verify_step "Service health check"
# 清理环境
show_step "5" "Cleanup - Stop all services"
./scripts/05_down.sh
verify_step "Service cleanup"
# 计算总测试时间
test_end_time=$(date +%s)
total_time=$((test_end_time - test_start_time))
echo ""
echo "======================================="
echo "🎉 END-TO-END TEST COMPLETED SUCCESSFULLY!"
echo "======================================="
echo "📊 Test Summary:"
echo " • Total time: ${total_time}s"
echo " • Alertmanager status: $am_status"
echo " • All services started and stopped successfully"
echo ""
echo "✅ The ARGUS Alert system is working correctly!"
echo ""

View File

@ -1,113 +0,0 @@
#!/bin/bash
# verify_alertmanager.sh
# 用于部署后验证 Prometheus 与 Alertmanager 通信链路是否正常
set -euo pipefail
#=============================
# 基础配置
#=============================
PROM_URL="${PROM_URL:-http://prom.metric.argus.com:9090}"
ALERT_URL="${ALERT_URL:-http://alertmanager.alert.argus.com:9093}"
# TODO: 根据实际部署环境调整规则目录
DATA_ROOT="${DATA_ROOT:-/private/argus}"
RULE_DIR = "$DATA_ROOT/metric/prometheus/rules"
TMP_RULE="/tmp/test_rule.yml"
#=============================
# 辅助函数
#=============================
GREEN="\033[32m"; RED="\033[31m"; YELLOW="\033[33m"; RESET="\033[0m"
log_info() { echo -e "${YELLOW}[INFO]${RESET} $1"; }
log_success() { echo -e "${GREEN}[OK]${RESET} $1"; }
log_error() { echo -e "${RED}[ERROR]${RESET} $1"; }
fail_exit() { log_error "$1"; exit 1; }
#=============================
# Step 1: 检查 Alertmanager 是否可访问
#=============================
log_info "检查 Alertmanager 状态..."
if curl -sSf "${ALERT_URL}/api/v2/status" >/dev/null 2>&1; then
log_success "Alertmanager 服务正常 (${ALERT_URL})"
else
fail_exit "无法访问 Alertmanager请检查端口映射与容器状态。"
fi
#=============================
# Step 2: 手动发送测试告警
#=============================
log_info "发送手动测试告警..."
curl -s -XPOST "${ALERT_URL}/api/v2/alerts" -H "Content-Type: application/json" -d '[
{
"labels": {
"alertname": "ManualTestAlert",
"severity": "info"
},
"annotations": {
"summary": "This is a test alert from deploy verification"
},
"startsAt": "'$(date -Iseconds)'"
}
]' >/dev/null && log_success "测试告警已成功发送到 Alertmanager"
#=============================
# Step 3: 检查 Prometheus 配置中是否包含 Alertmanager
#=============================
log_info "检查 Prometheus 是否配置了 Alertmanager..."
if curl -s "${PROM_URL}/api/v1/status/config" | grep -q "alertmanagers"; then
log_success "Prometheus 已配置 Alertmanager 目标"
else
fail_exit "Prometheus 未配置 Alertmanager请检查 prometheus.yml"
fi
#=============================
# Step 4: 创建并加载测试告警规则
#=============================
log_info "创建临时测试规则 ${TMP_RULE} ..."
cat <<EOF > "${TMP_RULE}"
groups:
- name: deploy-verify-group
rules:
- alert: DeployVerifyAlert
expr: vector(1)
labels:
severity: warning
annotations:
summary: "Deployment verification alert"
EOF
mkdir -p "${RULE_DIR}"
cp "${TMP_RULE}" "${RULE_DIR}/test_rule.yml"
log_info "重载 Prometheus 以加载新规则..."
if curl -s -X POST "${PROM_URL}/-/reload" >/dev/null; then
log_success "Prometheus 已重载规则"
else
fail_exit "Prometheus reload 失败,请检查 API 可访问性。"
fi
#=============================
# Step 5: 等待并验证 Alertmanager 是否收到告警
#=============================
log_info "等待告警触发 (约5秒)..."
sleep 5
if curl -s "${ALERT_URL}/api/v2/alerts" | grep -q "DeployVerifyAlert"; then
log_success "Prometheus → Alertmanager 告警链路验证成功"
else
fail_exit "未在 Alertmanager 中检测到 DeployVerifyAlert请检查网络或配置。"
fi
#=============================
# Step 6: 清理测试规则
#=============================
log_info "清理临时测试规则..."
rm -f "${RULE_DIR}/test_rule.yml" "${TMP_RULE}"
curl -s -X POST "${PROM_URL}/-/reload" >/dev/null \
&& log_success "Prometheus 已清理验证规则" \
|| log_error "Prometheus reload 清理失败,请手动确认。"
log_success "部署验证全部通过Prometheus ↔ Alertmanager 通信正常。"

View File

@ -1 +0,0 @@
.build*/

View File

@ -1,33 +0,0 @@
FROM ubuntu:22.04
ARG ARGUS_BUILD_UID=2133
ARG ARGUS_BUILD_GID=2015
ENV DEBIAN_FRONTEND=noninteractive \
TZ=Asia/Shanghai \
ARGUS_LOGS_WORLD_WRITABLE=1
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata \
cron procps supervisor vim less tar gzip python3; \
rm -rf /var/lib/apt/lists/*; \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
WORKDIR /
# Offline fluent-bit assets and bundle tarball are staged by the build script
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
COPY private/etc /private/etc
COPY private/packages /private/packages
COPY bundle/ /bundle/
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
if [ "${ARGUS_LOGS_WORLD_WRITABLE}" = "1" ]; then chmod 1777 /logs/train /logs/infer || true; else chmod 755 /logs/train /logs/infer || true; fi; \
chmod 770 /buffers || true
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]

View File

@ -1,59 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# health-watcher.sh (CPU node bundle)
# 周期执行 check_health.sh 与 restart_unhealthy.sh用于节点容器内自愈。
INSTALL_ROOT="/opt/argus-metric"
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
VER_DIR="${1:-}"
log(){ echo "[HEALTH-WATCHER] $*"; }
resolve_ver_dir() {
local dir=""
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
dir="$VER_DIR"
elif [[ -L "$INSTALL_ROOT/current" ]]; then
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$dir" ]]; then
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
echo "$dir"
}
main() {
log "starting with interval=${INTERVAL}s"
local dir
dir="$(resolve_ver_dir)"
if [[ -z "$dir" || ! -d "$dir" ]]; then
log "no valid install dir found under $INSTALL_ROOT; exiting"
exit 0
fi
local chk="$dir/check_health.sh"
local rst="$dir/restart_unhealthy.sh"
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
exit 0
fi
log "watching install dir: $dir"
while :; do
if [[ -x "$chk" ]]; then
log "running check_health.sh"
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
fi
if [[ -x "$rst" ]]; then
log "running restart_unhealthy.sh"
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
fi
sleep "$INTERVAL"
done
}
main "$@"

View File

@ -1,131 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
echo "[BOOT] CPU node bundle starting"
INSTALL_ROOT="/opt/argus-metric"
BUNDLE_DIR="/bundle"
STATE_DIR_BASE="/private/argus/agent"
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
chmod 770 /buffers || true
installed_ok=0
# 1) already installed?
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
else
# 2) try local bundle first (argus-metric_*.tar.gz)
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
if [[ -n "${tarball:-}" ]]; then
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
tmp=$(mktemp -d)
tar -xzf "$tarball" -C "$tmp"
# locate root containing version.json
root="$tmp"
if [[ ! -f "$root/version.json" ]]; then
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
fi
if [[ ! -f "$root/version.json" ]]; then
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
else
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
if [[ -z "$ver" ]]; then
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
else
target_root="$INSTALL_ROOT"
version_dir="$target_root/versions/$ver"
mkdir -p "$version_dir"
shopt -s dotglob
mv "$root"/* "$version_dir/" 2>/dev/null || true
shopt -u dotglob
if [[ -f "$version_dir/install.sh" ]]; then
chmod +x "$version_dir/install.sh" 2>/dev/null || true
(
export AUTO_START_DCGM="0" # N/A on CPU
cd "$version_dir" && ./install.sh "$version_dir"
)
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
installed_ok=1
echo "[BOOT] local bundle install OK: version=$ver"
else
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
fi
else
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
fi
fi
fi
fi
# 3) fallback: use FTP setup if not installed
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
echo "[BOOT] fallback to FTP setup"
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
exit 1
fi
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
chmod +x /tmp/setup.sh
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
fi
fi
# 4) ensure argus-agent is running (best-effort)
if ! pgrep -x argus-agent >/dev/null 2>&1; then
echo "[BOOT] starting argus-agent (not detected)"
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
fi
# 5) post-install selfcheck and state
ver_dir=""
if [[ -L "$INSTALL_ROOT/current" ]]; then
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$ver_dir" ]]; then
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
else
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
fi
else
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
fi
host="$(hostname)"
state_dir="$STATE_DIR_BASE/${host}"
mkdir -p "$state_dir" 2>/dev/null || true
for i in {1..60}; do
if [[ -s "$state_dir/node.json" ]]; then
echo "[BOOT] node state present: $state_dir/node.json"
break
fi
sleep 2
done
# 6) spawn health watcher (best-effort, non-blocking)
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
echo "[BOOT] starting health watcher for $ver_dir"
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
else
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
fi
echo "[BOOT] ready; entering sleep"
exec sleep infinity

View File

@ -1 +0,0 @@
.build*/

View File

@ -1,44 +0,0 @@
ARG CUDA_VER=12.2.2
FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu22.04
ARG CLIENT_VER=0.0.0
ARG BUNDLE_DATE=00000000
LABEL org.opencontainers.image.title="argus-sys-metric-test-node-bundle-gpu" \
org.opencontainers.image.description="GPU node bundle with embedded Argus client artifact" \
org.opencontainers.image.version="${CLIENT_VER}" \
org.opencontainers.image.revision_date="${BUNDLE_DATE}" \
maintainer="Argus"
ENV DEBIAN_FRONTEND=noninteractive \
TZ=Asia/Shanghai \
ARGUS_LOGS_WORLD_WRITABLE=1 \
ES_HOST=es.log.argus.com \
ES_PORT=9200 \
CLUSTER=local \
RACK=dev
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata cron procps vim less \
tar gzip; \
rm -rf /var/lib/apt/lists/*; \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
WORKDIR /
# Expect staged build context to provide these directories/files
COPY bundle/ /bundle/
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
COPY private/etc /private/etc
COPY private/packages /private/packages
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
chmod 1777 /logs/train /logs/infer || true; \
chmod 770 /buffers || true
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]

View File

@ -1,59 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# health-watcher.sh (GPU bundle)
# 周期执行 check_health.sh 与 restart_unhealthy.sh用于 GPU 节点容器内自愈。
INSTALL_ROOT="/opt/argus-metric"
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
VER_DIR="${1:-}"
log(){ echo "[HEALTH-WATCHER] $*"; }
resolve_ver_dir() {
local dir=""
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
dir="$VER_DIR"
elif [[ -L "$INSTALL_ROOT/current" ]]; then
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$dir" ]]; then
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
echo "$dir"
}
main() {
log "starting with interval=${INTERVAL}s"
local dir
dir="$(resolve_ver_dir)"
if [[ -z "$dir" || ! -d "$dir" ]]; then
log "no valid install dir found under $INSTALL_ROOT; exiting"
exit 0
fi
local chk="$dir/check_health.sh"
local rst="$dir/restart_unhealthy.sh"
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
exit 0
fi
log "watching install dir: $dir"
while :; do
if [[ -x "$chk" ]]; then
log "running check_health.sh"
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
fi
if [[ -x "$rst" ]]; then
log "running restart_unhealthy.sh"
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
fi
sleep "$INTERVAL"
done
}
main "$@"

View File

@ -1,135 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
echo "[BOOT] GPU node bundle starting"
INSTALL_ROOT="/opt/argus-metric"
BUNDLE_DIR="/bundle"
STATE_DIR_BASE="/private/argus/agent"
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
chmod 770 /buffers || true
installed_ok=0
# 1) already installed?
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
else
# 2) try local bundle first (argus-metric_*.tar.gz)
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
if [[ -n "${tarball:-}" ]]; then
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
tmp=$(mktemp -d)
tar -xzf "$tarball" -C "$tmp"
# locate root containing version.json
root="$tmp"
if [[ ! -f "$root/version.json" ]]; then
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
fi
if [[ ! -f "$root/version.json" ]]; then
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
else
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
if [[ -z "$ver" ]]; then
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
else
target_root="$INSTALL_ROOT"
version_dir="$target_root/versions/$ver"
mkdir -p "$version_dir"
shopt -s dotglob
mv "$root"/* "$version_dir/" 2>/dev/null || true
shopt -u dotglob
if [[ -f "$version_dir/install.sh" ]]; then
chmod +x "$version_dir/install.sh" 2>/dev/null || true
(
export AUTO_START_DCGM="${AUTO_START_DCGM:-1}"
export DCGM_EXPORTER_DISABLE_PROFILING="${DCGM_EXPORTER_DISABLE_PROFILING:-1}"
export DCGM_EXPORTER_LISTEN="${DCGM_EXPORTER_LISTEN:-:9400}"
cd "$version_dir" && ./install.sh "$version_dir"
)
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
installed_ok=1
echo "[BOOT] local bundle install OK: version=$ver"
else
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
fi
else
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
fi
fi
fi
fi
# 3) fallback: use FTP setup if not installed
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
echo "[BOOT] fallback to FTP setup"
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
exit 1
fi
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
chmod +x /tmp/setup.sh
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
fi
fi
# 4) ensure argus-agent is running (best-effort)
if ! pgrep -x argus-agent >/dev/null 2>&1; then
echo "[BOOT] starting argus-agent (not detected)"
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
fi
# 5) post-install selfcheck (run once) and state
# prefer current version dir; fallback to first version under /opt/argus-metric/versions
ver_dir=""
if [[ -L "$INSTALL_ROOT/current" ]]; then
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$ver_dir" ]]; then
# pick the latest by name (semver-like); best-effort
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
else
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
fi
else
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
fi
host="$(hostname)"
state_dir="$STATE_DIR_BASE/${host}"
mkdir -p "$state_dir" 2>/dev/null || true
for i in {1..60}; do
if [[ -s "$state_dir/node.json" ]]; then
echo "[BOOT] node state present: $state_dir/node.json"
break
fi
sleep 2
done
# 6) spawn health watcher (best-effort, non-blocking)
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
echo "[BOOT] starting health watcher for $ver_dir"
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
else
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
fi
echo "[BOOT] ready; entering sleep"
exec sleep infinity

View File

@ -22,7 +22,8 @@
[PARSER]
Name timestamp_parser
Format regex
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:Z|[+-]\d{2}:?\d{2}))\s+(?<level>\w+)\s+(?<message>.*)$
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<message>.*)$
Time_Key timestamp
Time_Format %Y-%m-%dT%H:%M:%S%z
Time_Format %Y-%m-%d %H:%M:%S
Time_Offset +0800
Time_Keep On

View File

@ -77,20 +77,7 @@ cp -r /tmp/flb/etc/* /etc/fluent-bit/
# Create logs/buffers dirs
mkdir -p /logs/train /logs/infer /buffers
# 控制日志目录权限:默认对宿主 bind mount 目录采用 1777可由环境变量关闭
: "${ARGUS_LOGS_WORLD_WRITABLE:=1}"
if [[ "${ARGUS_LOGS_WORLD_WRITABLE}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
# 缓冲目录仅供进程使用,不对外开放写入
chmod 770 /buffers || true
# 目录属主设置为 fluent-bit不影响 1777 粘滞位)
chown -R fluent-bit:fluent-bit /logs /buffers 2>/dev/null || true
chmod 755 /logs/train /logs/infer /buffers
# Wait for Elasticsearch via bash /dev/tcp to avoid curl dependency
echo "[INFO] Waiting for Elasticsearch to be ready (tcp ${ES_HOST}:${ES_PORT})..."

View File

@ -28,11 +28,11 @@ fi
docker exec "$container_name" mkdir -p /logs/train /logs/infer
# 写入训练日志 (host01)
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
# 写入推理日志 (host01)
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "cat <<'STACK' >> /logs/infer/infer-demo.log
Traceback (most recent call last):
File \"inference.py\", line 15, in <module>

View File

@ -28,13 +28,13 @@ fi
docker exec "$container_name" mkdir -p /logs/train /logs/infer
# 写入训练日志 (host02)
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
# 写入推理日志 (host02)
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
echo "[OK] 已通过docker exec写入测试日志到 host02 容器内:"
echo " - /logs/train/train-demo.log"

View File

@ -13,8 +13,6 @@ class AppConfig:
scheduler_interval_seconds: int
node_id_prefix: str
auth_mode: str
target_prefer_net_cidrs: str
target_reachability_check: bool
def _get_int_env(name: str, default: int) -> int:
@ -29,12 +27,6 @@ def _get_int_env(name: str, default: int) -> int:
def load_config() -> AppConfig:
"""读取环境变量生成配置对象,方便统一管理运行参数。"""
def _bool_env(name: str, default: bool) -> bool:
raw = os.environ.get(name)
if raw is None or raw.strip() == "":
return default
return raw.strip().lower() in ("1", "true", "yes", "on")
return AppConfig(
db_path=os.environ.get("DB_PATH", "/private/argus/master/db.sqlite3"),
metric_nodes_json_path=os.environ.get(
@ -45,6 +37,4 @@ def load_config() -> AppConfig:
scheduler_interval_seconds=_get_int_env("SCHEDULER_INTERVAL_SECONDS", 30),
node_id_prefix=os.environ.get("NODE_ID_PREFIX", "A"),
auth_mode=os.environ.get("AUTH_MODE", "disabled"),
target_prefer_net_cidrs=os.environ.get("TARGET_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16"),
target_reachability_check=_bool_env("TARGET_REACHABILITY_CHECK", False),
)

View File

@ -1,10 +1,8 @@
from __future__ import annotations
import ipaddress
import logging
import socket
import threading
from typing import Optional, Iterable, Dict, Any, List
from typing import Optional
from .config import AppConfig
from .storage import Storage
@ -36,117 +34,10 @@ class StatusScheduler:
self._pending_nodes_json.set()
def generate_nodes_json(self) -> None:
"""根据在线节点生成 Prometheus 抓取目标,优先 overlay IP。
候选顺序meta.overlay_ip > hostname A 记录命中偏好网段> meta.ip
可选 reachability 检查TARGET_REACHABILITY_CHECK=true 9100/9400 做一次 1s TCP 连接测试
选择首个可达的候选全部失败则按顺序取第一个并记录日志
"""
with self._nodes_json_lock:
rows = self._storage.get_online_nodes_meta()
prefer_cidrs = self._parse_cidrs(self._config.target_prefer_net_cidrs)
reachability = self._config.target_reachability_check
result: List[Dict[str, Any]] = []
for row in rows:
meta = row.get("meta", {})
hostname = meta.get("hostname") or row.get("name")
labels = row.get("labels") or []
overlay_ip = meta.get("overlay_ip")
legacy_ip = meta.get("ip")
host_candidates = self._resolve_host_ips(hostname)
host_pref = self._pick_by_cidrs(host_candidates, prefer_cidrs)
candidates: List[str] = []
for ip in [overlay_ip, host_pref, legacy_ip]:
if ip and ip not in candidates:
candidates.append(ip)
chosen = None
if reachability:
ports = [9100]
try:
if int(meta.get("gpu_number", 0)) > 0:
ports.append(9400)
except Exception:
pass
for ip in candidates:
if any(self._reachable(ip, p, 1.0) for p in ports):
chosen = ip
break
if not chosen:
chosen = candidates[0] if candidates else legacy_ip
if not chosen:
# ultimate fallback: 127.0.0.1 (should not happen)
chosen = "127.0.0.1"
self._logger.warning("No candidate IPs for node; falling back", extra={"node": row.get("node_id")})
if chosen and ipaddress.ip_address(chosen) in ipaddress.ip_network("172.22.0.0/16"):
self._logger.warning(
"Prometheus target uses docker_gwbridge address; prefer overlay",
extra={"node": row.get("node_id"), "ip": chosen},
)
result.append(
{
"node_id": row.get("node_id"),
"user_id": meta.get("user"),
"ip": chosen,
"hostname": hostname,
"labels": labels if isinstance(labels, list) else [],
}
)
atomic_write_json(self._config.metric_nodes_json_path, result)
self._logger.info("nodes.json updated", extra={"count": len(result)})
# ---------------------------- helpers ----------------------------
@staticmethod
def _parse_cidrs(raw: str) -> List[ipaddress.IPv4Network]:
nets: List[ipaddress.IPv4Network] = []
for item in (x.strip() for x in (raw or "").split(",")):
if not item:
continue
try:
net = ipaddress.ip_network(item, strict=False)
if isinstance(net, ipaddress.IPv4Network):
nets.append(net)
except ValueError:
continue
return nets
@staticmethod
def _resolve_host_ips(hostname: str) -> List[str]:
ips: List[str] = []
try:
infos = socket.getaddrinfo(hostname, None, family=socket.AF_INET)
for info in infos:
ip = info[4][0]
if ip not in ips:
ips.append(ip)
except OSError:
pass
return ips
@staticmethod
def _pick_by_cidrs(candidates: Iterable[str], prefer: List[ipaddress.IPv4Network]) -> str | None:
for net in prefer:
for ip in candidates:
try:
if ipaddress.ip_address(ip) in net:
return ip
except ValueError:
continue
return None
@staticmethod
def _reachable(ip: str, port: int, timeout: float) -> bool:
try:
with socket.create_connection((ip, port), timeout=timeout):
return True
except OSError:
return False
online_nodes = self._storage.get_online_nodes()
atomic_write_json(self._config.metric_nodes_json_path, online_nodes)
self._logger.info("nodes.json updated", extra={"count": len(online_nodes)})
# ------------------------------------------------------------------
# internal loop

View File

@ -324,35 +324,9 @@ class Storage:
{
"node_id": row["id"],
"user_id": meta.get("user"),
"ip": meta.get("ip"), # kept for backward-compat; preferred IP selection handled in scheduler
"ip": meta.get("ip"),
"hostname": meta.get("hostname", row["name"]),
"labels": labels if isinstance(labels, list) else [],
}
)
return result
def get_online_nodes_meta(self) -> List[Dict[str, Any]]:
"""返回在线节点的原始 meta 与名称、标签,交由上层选择目标 IP。
每项包含{ node_id, name, meta, labels }
"""
with self._lock:
cur = self._conn.execute(
"SELECT id, name, meta_json, labels_json FROM nodes WHERE status = ? ORDER BY id ASC",
("online",),
)
rows = cur.fetchall()
result: List[Dict[str, Any]] = []
for row in rows:
meta = json.loads(row["meta_json"]) if row["meta_json"] else {}
labels = json.loads(row["labels_json"]) if row["labels_json"] else []
result.append(
{
"node_id": row["id"],
"name": row["name"],
"meta": meta if isinstance(meta, dict) else {},
"labels": labels if isinstance(labels, list) else [],
}
)
return result

View File

@ -1 +1 @@
1.44.0
1.35.0

Binary file not shown.

View File

@ -14,16 +14,6 @@ log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
# 运行时开关(可通过环境变量覆盖)
# 1) 是否自动启动 nv-hostengine容器内通常没有 systemd
AUTO_START_DCGM="${AUTO_START_DCGM:-1}"
# 2) 是否默认禁用 Profiling 指标(避免在部分环境触发 DCGM Profiling 崩溃)
DCGM_EXPORTER_DISABLE_PROFILING="${DCGM_EXPORTER_DISABLE_PROFILING:-1}"
# 3) 自定义 collectors 文件;若为空且禁用 Profiling则自动生成 no-prof 清单
DCGM_EXPORTER_COLLECTORS="${DCGM_EXPORTER_COLLECTORS:-}"
# 4) 监听地址
DCGM_EXPORTER_LISTEN="${DCGM_EXPORTER_LISTEN:-:9400}"
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
@ -170,21 +160,10 @@ check_dcgm_service() {
elif pgrep -f nv-hostengine > /dev/null; then
log_success "nv-hostengine 进程已在运行"
else
log_warning "DCGM 服务未运行"
if [[ "${AUTO_START_DCGM}" == "1" ]]; then
log_info "尝试自动启动 nv-hostengine容器内无 systemd 场景)..."
nohup nv-hostengine > /var/log/nv-hostengine.log 2>&1 &
sleep 2
if pgrep -f nv-hostengine >/dev/null; then
log_success "nv-hostengine 已启动"
else
log_error "nv-hostengine 启动失败,请手动检查 /var/log/nv-hostengine.log"
fi
else
log_info "启动 DCGM 服务的方法:"
log_info " 1. 使用 systemd: sudo systemctl start dcgm"
log_info " 2. 手动启动: nohup nv-hostengine > /var/log/nv-hostengine.log 2>&1 &"
fi
log_warning "DCGM 服务未运行,需要手动启动"
log_info "启动 DCGM 服务的方法:"
log_info " 1. 使用 systemd: sudo systemctl start dcgm"
log_info " 2. 手动启动: nohup nv-hostengine > /var/log/nv-hostengine.log 2>&1 &"
fi
# 测试 DCGM 连接
@ -193,7 +172,7 @@ check_dcgm_service() {
if dcgmi discovery -l > /dev/null 2>&1; then
log_success "DCGM 连接测试成功"
else
log_warning "DCGM 连接测试失败,请检查服务状态(驱动/权限/设备可见性)"
log_warning "DCGM 连接测试失败,请检查服务状态"
fi
fi
}
@ -290,7 +269,6 @@ start_dcgm_exporter() {
local binary_path="/usr/local/bin/dcgm-exporter"
local log_file="/var/log/dcgm-exporter.log"
local pid_file="/var/run/dcgm-exporter.pid"
local collectors_arg=""
# 检查服务是否已经在运行
if [[ -f "$pid_file" ]]; then
@ -304,48 +282,15 @@ start_dcgm_exporter() {
fi
fi
# 计算 collectors 参数
if [[ -n "${DCGM_EXPORTER_COLLECTORS}" ]]; then
if [[ -f "${DCGM_EXPORTER_COLLECTORS}" ]]; then
collectors_arg=(--collectors "${DCGM_EXPORTER_COLLECTORS}")
log_info "使用自定义 collectors: ${DCGM_EXPORTER_COLLECTORS}"
else
log_warning "指定的 DCGM_EXPORTER_COLLECTORS 文件不存在: ${DCGM_EXPORTER_COLLECTORS}(将忽略)"
fi
elif [[ "${DCGM_EXPORTER_DISABLE_PROFILING}" == "1" ]]; then
local cfg_dir="/etc/dcgm-exporter"
local default_cfg="${cfg_dir}/default-counters.csv"
local no_prof_cfg="${cfg_dir}/no-prof.csv"
mkdir -p "${cfg_dir}"
if [[ -f "${default_cfg}" ]]; then
grep -v 'DCGM_FI_PROF_' "${default_cfg}" > "${no_prof_cfg}" || true
collectors_arg=(--collectors "${no_prof_cfg}")
log_info "已生成无 Profiling 的 collectors: ${no_prof_cfg}"
else
log_warning "未找到默认 collectors 文件: ${default_cfg}"
fi
fi
# 检查端口是否被占用
if netstat -tuln 2>/dev/null | grep -q ":${DCGM_EXPORTER_LISTEN#:} "; then
if netstat -tuln 2>/dev/null | grep -q ":9400 "; then
log_warning "端口 9400 已被占用,请检查是否有其他服务在运行"
return 1
fi
# 启动前再校验一次 DCGM 主机引擎
if ! (systemctl is-active --quiet dcgm 2>/dev/null || pgrep -f nv-hostengine >/dev/null); then
log_warning "nv-hostengine 未运行,尝试自动启动"
nohup nv-hostengine > /var/log/nv-hostengine.log 2>&1 &
sleep 2
fi
# 启动服务
log_info "正在启动 DCGM Exporter..."
if [[ ${#collectors_arg[@]} -gt 0 ]]; then
nohup "$binary_path" --address="${DCGM_EXPORTER_LISTEN}" "${collectors_arg[@]}" > "$log_file" 2>&1 &
else
nohup "$binary_path" --address="${DCGM_EXPORTER_LISTEN}" > "$log_file" 2>&1 &
fi
nohup "$binary_path" --address=:9400 > "$log_file" 2>&1 &
local pid=$!
# 保存 PID
@ -365,20 +310,6 @@ start_dcgm_exporter() {
else
log_error "DCGM Exporter 服务启动失败"
rm -f "$pid_file"
# 失败回退:若未禁用 Profiling也未指定 collectors则尝试自动回退到 no-prof 再起一次
if [[ -z "${DCGM_EXPORTER_COLLECTORS}" && "${DCGM_EXPORTER_DISABLE_PROFILING}" != "1" ]]; then
log_warning "尝试以无 Profiling 清单回退启动"
local cfg_dir="/etc/dcgm-exporter"; local default_cfg="${cfg_dir}/default-counters.csv"; local no_prof_cfg="${cfg_dir}/no-prof.csv"
if [[ -f "${default_cfg}" ]]; then
grep -v 'DCGM_FI_PROF_' "${default_cfg}" > "${no_prof_cfg}" || true
nohup "$binary_path" --address="${DCGM_EXPORTER_LISTEN}" --collectors "${no_prof_cfg}" > "$log_file" 2>&1 &
sleep 2
if pgrep -f dcgm-exporter >/dev/null; then
log_success "DCGM Exporter 已用无 Profiling 清单启动"
return 0
fi
fi
fi
return 1
fi
}

View File

@ -48,15 +48,6 @@ if [[ ${#missing_files[@]} -gt 0 ]]; then
exit 1
fi
# 防御:阻止将 Git LFS 指针文件打包
for f in bin/dcgm-exporter bin/datacenter-gpu-manager_3.3.9_amd64.deb; do
if head -n1 "$f" 2>/dev/null | grep -q '^version https://git-lfs.github.com/spec/v1$'; then
echo "[ERROR] $f 是 Git LFS 指针文件,未还原为真实制品"
echo " 请在仓库根目录执行: git lfs fetch --all && git lfs checkout"
exit 1
fi
done
log_success "所有必要文件检查完成"
# 创建临时目录

View File

@ -1,11 +1,9 @@
# 重要:使用 Logstash_Format + Logstash_Prefix生成 train-*/infer-* 索引
# 说明Fluent Bit 配置仅支持 ${VAR} 占位符,不支持 Bash 的 ${VAR:-default}
# 固定域名要求:使用 es.log.argus.com 与端口 9200
[OUTPUT]
Name es
Match app.train
Host es.log.argus.com
Port 9200
Host ${ES_HOST:-localhost}
Port ${ES_PORT:-9200}
Logstash_Format On
Logstash_Prefix train
Replace_Dots On
@ -16,8 +14,8 @@
[OUTPUT]
Name es
Match app.infer
Host es.log.argus.com
Port 9200
Host ${ES_HOST:-localhost}
Port ${ES_PORT:-9200}
Logstash_Format On
Logstash_Prefix infer
Replace_Dots On

View File

@ -22,6 +22,6 @@
[PARSER]
Name timestamp_parser
Format regex
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:Z|[+-]\d{2}:?\d{2}))\s+(?<level>\w+)\s+(?<message>.*)$
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<message>.*)$
Time_Key timestamp
Time_Format %Y-%m-%dT%H:%M:%S%z
Time_Format %Y-%m-%d %H:%M:%S

View File

@ -171,16 +171,9 @@ fi
# 创建日志和缓冲区目录
log_info "Creating log and buffer directories..."
mkdir -p /logs/train /logs/infer /buffers
# 对共享日志目录采用 1777含粘滞位便于宿主任意账号创建文件/目录
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
# 缓冲目录限进程使用
chmod 770 /buffers || true
# 目录属主设置,不影响 1777 粘滞位
chown -R fluent-bit:fluent-bit /logs /buffers 2>/dev/null || true
chmod 755 /logs/train /logs/infer
chmod 770 /buffers
chown -R fluent-bit:fluent-bit /logs /buffers
# 启动 Fluent Bit
log_info "Starting Fluent Bit with configuration from /etc/fluent-bit/"
@ -213,8 +206,7 @@ export HOSTNAME
export CLUSTER="${CLUSTER:-local}"
export RACK="${RACK:-dev}"
# 默认使用固定域名(满足“固定域名”需求);若外部传入覆盖,则使用外部值
export ES_HOST="${ES_HOST:-es.log.argus.com}"
export ES_HOST="${ES_HOST:-localhost}"
export ES_PORT="${ES_PORT:-9200}"
log_info "Environment variables:"

View File

@ -47,13 +47,6 @@ if [[ ${#missing_files[@]} -gt 0 ]]; then
exit 1
fi
# 防御:阻止将 Git LFS 指针文件打包
if head -n1 bin/node_exporter 2>/dev/null | grep -q '^version https://git-lfs.github.com/spec/v1$'; then
echo "[ERROR] bin/node_exporter 是 Git LFS 指针文件,未还原为真实二进制"
echo " 请在仓库根目录执行: git lfs fetch --all && git lfs checkout"
exit 1
fi
log_success "所有必要文件检查完成"
# 创建临时目录

View File

@ -274,33 +274,19 @@ verify_checksums() {
log_info "Artifact 目录: $artifact_dir"
failed_verification=0
# 尝试解析 version.json 中的 install_order用于锁定精确文件名避免同一目录下多份历史 tar 产生歧义
local order_file="$TEMP_DIR/install_order.txt"
if [[ -f "$TEMP_DIR/checksums.txt" ]]; then
while IFS= read -r line; do
component=$(echo "$line" | cut -d':' -f1)
expected_checksum=$(echo "$line" | cut -d':' -f2-)
# 优先从 install_order 中推导精确文件名
# 查找匹配的 tar 文件
actual_file=""
if [[ -f "$order_file" ]]; then
while IFS= read -r fname; do
if [[ "$fname" == ${component}-*.tar.gz && -f "$artifact_dir/$fname" ]]; then
actual_file="$artifact_dir/$fname"
break
fi
done < "$order_file"
fi
# 回退:按前缀匹配首个(不推荐,但保持兼容)
if [[ -z "$actual_file" ]]; then
for file in "$artifact_dir/${component}-"*.tar.gz; do
if [[ -f "$file" ]]; then
actual_file="$file"
break
fi
done
fi
for file in "$artifact_dir/${component}-"*.tar.gz; do
if [[ -f "$file" ]]; then
actual_file="$file"
break
fi
done
if [[ -z "$actual_file" ]]; then
log_error "找不到组件文件: $component"

View File

@ -59,12 +59,6 @@ ARTIFACT_DIR="artifact/$VERSION"
log_info "开始打包 AIOps All-in-One 安装包 v$VERSION"
# 若强制打包且目录已存在,先清理旧产物以避免同一版本下残留多个 tar.gz 导致校验混乱
if [[ "$FORCE_PACKAGE" == "true" && -d "$ARTIFACT_DIR" ]]; then
log_info "--force: 清理旧的 $ARTIFACT_DIR 下的 tar 与元数据"
rm -rf "$ARTIFACT_DIR"
fi
# 检查必要文件
log_info "检查必要文件..."
if [[ ! -f "config/VERSION" ]]; then
@ -136,7 +130,7 @@ if [[ -d "$ARTIFACT_DIR" && "$FORCE_PACKAGE" == "false" ]]; then
fi
fi
# 创建 artifact 目录(清理后重建)
# 创建 artifact 目录
mkdir -p "$ARTIFACT_DIR"
log_info "创建输出目录: $ARTIFACT_DIR"
@ -222,36 +216,6 @@ if [[ ${#missing_components[@]} -gt 0 ]]; then
exit 1
fi
# 额外校验:阻止将 Git LFS 指针文件打进安装包
# 仅检查各组件目录下的 bin/ 内文件(常见为二进制或 .deb/.tar.gz 制品)
is_lfs_pointer() {
local f="$1"
# 读取首行判断是否为 LFS pointer无需依赖 file 命令)
head -n1 "$f" 2>/dev/null | grep -q '^version https://git-lfs.github.com/spec/v1$'
}
log_info "检查组件二进制是否已从 LFS 拉取..."
while IFS= read -r component; do
component_path=$(grep "^$component:" "$TEMP_DIR/component_paths.txt" | cut -d':' -f2-)
bin_dir="$component_path/bin"
[[ -d "$bin_dir" ]] || continue
while IFS= read -r f; do
# 只检查常见可执行/包后缀;无后缀的也检查
case "$f" in
*.sh) continue;;
*) :;;
esac
if is_lfs_pointer "$f"; then
log_error "检测到 Git LFS 指针文件: $f"
log_error "请在仓库根目录执行: git lfs fetch --all && git lfs checkout"
log_error "或确保 CI 在打包前已还原 LFS 大文件。"
rm -rf "$TEMP_DIR"
exit 1
fi
done < <(find "$bin_dir" -maxdepth 1 -type f 2>/dev/null | sort)
done < "$COMPONENTS_FILE"
log_success "LFS 校验通过:未发现指针文件"
# 打包各个组件
log_info "开始打包组件..."
@ -270,19 +234,7 @@ while IFS= read -r component; do
# 进入组件目录
cd "$component_path"
# 组件内二次防御:若包脚本缺失 LFS 校验,这里再次阻断
if [[ -d bin ]]; then
for f in bin/*; do
[[ -f "$f" ]] || continue
if head -n1 "$f" 2>/dev/null | grep -q '^version https://git-lfs.github.com/spec/v1$'; then
log_error "组件 $component 含 LFS 指针文件: $f"
log_error "请执行: git lfs fetch --all && git lfs checkout"
cd "$CURRENT_DIR"; rm -rf "$TEMP_DIR"; exit 1
fi
done
fi
# 检查组件是否有 package.sh
if [[ ! -f "package.sh" ]]; then
log_error "$component 缺少 package.sh 文件"
@ -291,13 +243,10 @@ while IFS= read -r component; do
exit 1
fi
# 清理组件目录内历史 tar 包,避免 find 误选旧文件
rm -f ./*.tar.gz 2>/dev/null || true
# 执行组件的打包脚本
if ./package.sh; then
# 查找生成的 tar 包
tar_file=$(ls -1t ./*.tar.gz 2>/dev/null | head -1)
tar_file=$(find . -name "*.tar.gz" -type f | head -1)
if [[ -n "$tar_file" ]]; then
# 移动到 artifact 目录
mv "$tar_file" "$CURRENT_DIR/$ARTIFACT_DIR/"

View File

@ -130,40 +130,20 @@ fi
TEMP_PACKAGE_DIR="/tmp/argus-metric-package-$$"
mkdir -p "$TEMP_PACKAGE_DIR"
# 仅复制 version.json 中 install_order 列出的 tar.gz防止同一版本目录下历史残留文件导致校验不一致
log_info "准备 artifact 文件(按 install_order..."
# 复制所有 tar.gz 文件到临时目录
log_info "准备 artifact 文件..."
tar_files=$(find "$ARTIFACT_DIR" -name "*.tar.gz" -type f)
install_list_file="$TEMP_DIR/install_list.txt"
if command -v jq >/dev/null 2>&1; then
jq -r '.install_order[]' "$ARTIFACT_DIR/version.json" > "$install_list_file" 2>/dev/null || true
else
# 简易解析
grep -A 200 '"install_order"' "$ARTIFACT_DIR/version.json" | grep -E '".*"' | sed 's/.*"\([^"]*\)".*/\1/' > "$install_list_file" 2>/dev/null || true
if [[ -z "$tar_files" ]]; then
log_error "$ARTIFACT_DIR 中未找到 tar.gz 文件"
exit 1
fi
if [[ -s "$install_list_file" ]]; then
while IFS= read -r filename; do
src="$ARTIFACT_DIR/$filename"
if [[ -f "$src" ]]; then
log_info " 拷贝: $filename"
cp "$src" "$TEMP_PACKAGE_DIR/"
else
log_warning " 未找到: $filename(跳过)"
fi
done < "$install_list_file"
else
log_warning "未能解析 install_order将回退复制全部 tar.gz可能包含历史残留建议安装端使用严格校验"
tar_files=$(find "$ARTIFACT_DIR" -name "*.tar.gz" -type f)
if [[ -z "$tar_files" ]]; then
log_error "$ARTIFACT_DIR 中未找到 tar.gz 文件"
exit 1
fi
for file in $tar_files; do
filename=$(basename "$file")
log_info " 准备: $filename"
cp "$file" "$TEMP_PACKAGE_DIR/"
done
fi
for file in $tar_files; do
filename=$(basename "$file")
log_info " 准备: $filename"
cp "$file" "$TEMP_PACKAGE_DIR/"
done
# 复制版本信息文件
if [[ -f "$ARTIFACT_DIR/version.json" ]]; then

View File

@ -48,31 +48,6 @@ BACKUPS_DIR="$INSTALL_DIR/backups" # 备份目录
CURRENT_LINK="$INSTALL_DIR/current" # 当前版本软链接
LATEST_VERSION_FILE="$INSTALL_DIR/LATEST_VERSION" # 当前版本记录文件
# 预检查Agent 元数据与 hostname 约束
require_agent_metadata() {
local hn
hn="$(hostname)"
local ok=false
# 三元环境变量
if [[ -n "${AGENT_ENV:-}" && -n "${AGENT_USER:-}" && -n "${AGENT_INSTANCE:-}" ]]; then
ok=true
fi
# host 形如 env-user-instance-xxx
if [[ "$hn" =~ ^[^-]+-[^-]+-[^-]+-.*$ ]]; then
ok=true
fi
if [[ "$ok" == false ]]; then
log_error "检测到 hostname 与 Agent 元数据不完整:"
log_error " 当前 hostname: $hn"
log_error " AGENT_ENV='${AGENT_ENV:-}' AGENT_USER='${AGENT_USER:-}' AGENT_INSTANCE='${AGENT_INSTANCE:-}'"
echo
log_info "请满足以下其一后重试:"
log_info " 方式A设置 hostname 为 env-user-instance-任意,例如 dev-alice-node001-pod-0"
log_info " 方式B导出环境变量export AGENT_ENV=dev AGENT_USER=alice AGENT_INSTANCE=node001"
exit 1
fi
}
# 检查必需的FTP参数
check_ftp_params() {
local missing_params=()
@ -898,47 +873,6 @@ rollback_version() {
fi
}
# 自检实现:等待 node.json 就绪且健康,并验证 last_report 持续更新
selfcheck_post_install() {
local hn="$(hostname)"
local node_file="/private/argus/agent/${AGENT_HOSTNAME:-$hn}/node.json"
local deadline=$(( $(date +%s) + 300 ))
local t1="" t2=""
while :; do
if [[ -f "$node_file" ]]; then
if command -v jq >/dev/null 2>&1; then
local ok_health lr
ok_health=$(jq -er '(.health["metric-argus-agent"].status=="healthy") and (.health["metric-node-exporter"].status=="healthy") and (.health["metric-fluent-bit"].status=="healthy") and (.health["metric-dcgm-exporter"].status=="healthy")' "$node_file" 2>/dev/null || echo false)
lr=$(jq -r '.last_report // ""' "$node_file" 2>/dev/null)
if [[ "$ok_health" == true && -n "$lr" ]]; then
if [[ -z "$t1" ]]; then
t1="$lr"
# agent 默认 60s 上报,等待 70s 再校验一次
sleep 70
continue
fi
t2="$lr"
if [[ "$t2" != "$t1" ]]; then
return 0
fi
# 若未变化,再等待一会儿直到超时
sleep 10
fi
else
# 无 jq 时的宽松校验
if grep -q '"status"\s*:\s*"healthy"' "$node_file"; then
return 0
fi
fi
fi
if (( $(date +%s) >= deadline )); then
log_error "自检超时:未在 5 分钟内确认 last_report 持续更新 或 健康状态不满足(路径:$node_file"
return 1
fi
sleep 5
done
}
# 主函数
main() {
echo "=========================================="
@ -978,26 +912,17 @@ main() {
# return 0
# fi
check_ftp_params
check_system
require_agent_metadata
check_ftp_params
check_system
if [[ "$ACTION" == "uninstall" ]]; then
uninstall_argus_metric
else
install_argus_metric
fi
# 安装后自检:最多等待 5 分钟,确认 node.json 存在且健康
echo
log_info "开始安装后自检(最多等待 5 分钟)..."
selfcheck_post_install || {
log_error "安装后自检未通过,请查看 /var/log/argus-agent.log 以及 /opt/argus-metric/versions/*/.install.log"
exit 1
}
echo
log_success "全部自检通过,安装完成!"
log_info "操作完成!"
}
# 脚本入口

View File

@ -67,8 +67,7 @@ RUN chmod +x /usr/local/bin/start-ftp-supervised.sh
COPY vsftpd.conf /etc/vsftpd/vsftpd.conf
COPY dns-monitor.sh /usr/local/bin/dns-monitor.sh
COPY dns-publish.sh /usr/local/bin/dns-publish.sh
RUN chmod +x /usr/local/bin/dns-monitor.sh /usr/local/bin/dns-publish.sh
RUN chmod +x /usr/local/bin/dns-monitor.sh
USER root

View File

@ -66,17 +66,6 @@ ${FTP_BASE_PATH}/
/private/argus/etc/
└── ${DOMAIN} # 容器IP记录文件
## DNS 同步到 FTP share运行期
- 运行期最新的 DNS 列表由 bind/master 写入挂载点 `/private/argus/etc/dns.conf`
- FTP 容器内置 `dns-publish`Supervised每 10s 比较并将该文件原子同步为 `${FTP_BASE_PATH}/share/dns.conf`,供客户端下载安装脚本直接读取。
- 同步特性:
- 原子更新:写入 `${DST}.tmp``mv -f` 覆盖,避免读到半写文件。
- 权限0644属主 `${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}`
- 可观测:日志 `/var/log/supervisor/dns-publish.log`
> 注:构建/发布阶段可能也会将静态 `config/dns.conf` 拷贝到 share当 FTP 容器运行后dns-publish 会用运行期最新文件覆盖该静态文件。
```
## vsftpd 配置说明
@ -167,4 +156,4 @@ curl -fsS 'ftp://ftpuser:ZGClab1234!@177.177.70.200/setup.sh' -o setup.sh
# root用户直接执行非root用户需要使用sudo
chmod +x setup.sh
bash setup.sh --server {$域名} --user ftpuser --password 'ZGClab1234!'
```
```

View File

@ -1,40 +0,0 @@
#!/bin/bash
set -uo pipefail
# Publish latest /private/argus/etc/dns.conf to ${FTP_BASE_PATH}/share/dns.conf
SRC="/private/argus/etc/dns.conf"
FTP_BASE_PATH="${FTP_BASE_PATH:-/private/argus/ftp}"
DST_DIR="${FTP_BASE_PATH}/share"
DST="${DST_DIR}/dns.conf"
UID_VAL="${ARGUS_BUILD_UID:-2133}"
GID_VAL="${ARGUS_BUILD_GID:-2015}"
INTERVAL="${DNS_PUBLISH_INTERVAL:-10}"
log() { echo "$(date '+%Y-%m-%d %H:%M:%S') [DNS-Publish] $*"; }
mkdir -p "$DST_DIR" 2>/dev/null || true
log "service start: SRC=$SRC DST=$DST interval=${INTERVAL}s"
while true; do
if [[ -f "$SRC" ]]; then
# Only sync when content differs
if ! cmp -s "$SRC" "$DST" 2>/dev/null; then
tmp="${DST}.tmp"
if cp "$SRC" "$tmp" 2>/dev/null; then
mv -f "$tmp" "$DST"
chown "$UID_VAL":"$GID_VAL" "$DST" 2>/dev/null || true
chmod 0644 "$DST" 2>/dev/null || true
ts_src=$(date -r "$SRC" '+%Y-%m-%dT%H:%M:%S%z' 2>/dev/null || echo "?")
log "synced dns.conf (src mtime=$ts_src) -> $DST"
else
log "ERROR: copy failed $SRC -> $tmp"
fi
fi
else
log "waiting for source $SRC"
fi
sleep "$INTERVAL"
done

View File

@ -28,18 +28,6 @@ stopwaitsecs=10
killasgroup=true
stopasgroup=true
[program:dns-publish]
command=/usr/local/bin/dns-publish.sh
user=root
stdout_logfile=/var/log/supervisor/dns-publish.log
stderr_logfile=/var/log/supervisor/dns-publish_error.log
autorestart=true
startretries=3
startsecs=5
stopwaitsecs=10
killasgroup=true
stopasgroup=true
[unix_http_server]
file=/var/run/supervisor.sock
chmod=0700

View File

@ -1 +0,0 @@
bundle/*.tar.gz

View File

@ -1,17 +0,0 @@
ARG BASE_IMAGE=argus-sys-metric-test-node:latest
FROM ${BASE_IMAGE}
ARG CLIENT_VER
LABEL org.opencontainers.image.title="argus-sys-metric-test-node-bundle" \
org.opencontainers.image.version="${CLIENT_VER}" \
org.opencontainers.image.description="Metric test node with embedded client package"
WORKDIR /
# bundle files are provided at build time into ./bundle in build context
COPY bundle/ /bundle/
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]

File diff suppressed because it is too large Load Diff

View File

@ -1,59 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# health-watcher.sh
# 周期执行 check_health.sh 与 restart_unhealthy.sh用于容器内节点自愈。
INSTALL_ROOT="/opt/argus-metric"
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
VER_DIR="${1:-}"
log(){ echo "[HEALTH-WATCHER] $*"; }
resolve_ver_dir() {
local dir=""
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
dir="$VER_DIR"
elif [[ -L "$INSTALL_ROOT/current" ]]; then
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$dir" ]]; then
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
echo "$dir"
}
main() {
log "starting with interval=${INTERVAL}s"
local dir
dir="$(resolve_ver_dir)"
if [[ -z "$dir" || ! -d "$dir" ]]; then
log "no valid install dir found under $INSTALL_ROOT; exiting"
exit 0
fi
local chk="$dir/check_health.sh"
local rst="$dir/restart_unhealthy.sh"
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
exit 0
fi
log "watching install dir: $dir"
while :; do
if [[ -x "$chk" ]]; then
log "running check_health.sh"
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
fi
if [[ -x "$rst" ]]; then
log "running restart_unhealthy.sh"
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
fi
sleep "$INTERVAL"
done
}
main "$@"

View File

@ -1,135 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
echo "[BOOT] node bundle starting"
INSTALL_DIR="/opt/argus-metric"
BUNDLE_DIR="/bundle"
installed_ok=0
# 1) already installed?
if [[ -L "$INSTALL_DIR/current" && -d "$INSTALL_DIR/current" ]]; then
echo "[BOOT] client already installed at $INSTALL_DIR/current"
else
# 2) try local bundle first (replicate setup.sh layout: move to /opt/argus-metric/versions/<ver> and run install.sh)
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
if [[ -n "${tarball:-}" ]]; then
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
tmp=$(mktemp -d)
tar -xzf "$tarball" -C "$tmp"
# locate root containing version.json
root="$tmp"
if [[ ! -f "$root/version.json" ]]; then
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
fi
if [[ ! -f "$root/version.json" ]]; then
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
else
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
if [[ -z "$ver" ]]; then
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
else
target_root="/opt/argus-metric"
version_dir="$target_root/versions/$ver"
mkdir -p "$version_dir"
# move contents into version dir
shopt -s dotglob
mv "$root"/* "$version_dir/" 2>/dev/null || true
shopt -u dotglob
# run component installer within version dir
if [[ -f "$version_dir/install.sh" ]]; then
chmod +x "$version_dir/install.sh" 2>/dev/null || true
# 传递运行时开关:容器内缺省启用 AUTO_START_DCGM=1、禁用 Profiling可通过环境变量覆盖
# 注意:不能用 `VAR=.. VAR2=.. (cmd)` 前缀到子 shellbash 不允许 env 赋值直接修饰 `(` 复合命令。
# 因此改为在子 subshell 中 export 后再执行。
(
export AUTO_START_DCGM="${AUTO_START_DCGM:-1}"
export DCGM_EXPORTER_DISABLE_PROFILING="${DCGM_EXPORTER_DISABLE_PROFILING:-1}"
export DCGM_EXPORTER_LISTEN="${DCGM_EXPORTER_LISTEN:-:9400}"
cd "$version_dir" && ./install.sh "$version_dir"
)
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
installed_ok=1
echo "[BOOT] local bundle install OK: version=$ver"
else
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
fi
else
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
fi
fi
fi
fi
# 3) fallback: use FTP setup if not installed
if [[ ! -L "$INSTALL_DIR/current" && "$installed_ok" -eq 0 ]]; then
echo "[BOOT] fallback to FTP setup"
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
exit 1
fi
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
chmod +x /tmp/setup.sh
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
fi
fi
# 4) ensure agent is running; start if needed (inherits env: MASTER_ENDPOINT/AGENT_*)
if ! pgrep -x argus-agent >/dev/null 2>&1; then
echo "[BOOT] starting argus-agent (not detected)"
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
fi
# 5) 若 dcgm-exporter 未监听(可能因 Profiling 崩溃),尝试无 Profiling 清单回退启动
if ! ss -tlnp 2>/dev/null | grep -q ":9400 "; then
echo "[BOOT] dcgm-exporter not listening; trying no-prof fallback"
pgrep -f nv-hostengine >/dev/null || (nohup nv-hostengine >/var/log/nv-hostengine.log 2>&1 & sleep 2)
cfg_dir="/etc/dcgm-exporter"; default_cfg="$cfg_dir/default-counters.csv"; no_prof_cfg="$cfg_dir/no-prof.csv"
if [[ -f "$default_cfg" ]]; then
grep -v 'DCGM_FI_PROF_' "$default_cfg" > "$no_prof_cfg" || true
pkill -f dcgm-exporter >/dev/null 2>&1 || true
nohup /usr/local/bin/dcgm-exporter --address="${DCGM_EXPORTER_LISTEN:-:9400}" --collectors "$no_prof_cfg" >/var/log/dcgm-exporter.log 2>&1 &
fi
fi
# 6) post-install selfcheck (best-effort) and wait for node.json
for i in {1..30}; do
if compgen -G "$INSTALL_DIR/versions/*/check_health.sh" > /dev/null; then
bash "$INSTALL_DIR"/versions/*/check_health.sh || true
break
fi
sleep 2
done
host="$(hostname)"
state_dir="/private/argus/agent/${host}"
mkdir -p "$state_dir" 2>/dev/null || true
for i in {1..60}; do
if [[ -s "$state_dir/node.json" ]]; then
echo "[BOOT] node state present: $state_dir/node.json"
break
fi
sleep 2
done
# 7) spawn health watcher (best-effort, non-blocking)
ver_dir=""
if [[ -L "$INSTALL_DIR/current" ]]; then
ver_dir="$(readlink -f "$INSTALL_DIR/current" 2>/dev/null || true)"
fi
if [[ -z "$ver_dir" ]]; then
ver_dir="$(ls -d "$INSTALL_DIR"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
echo "[BOOT] starting health watcher for $ver_dir"
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
else
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
fi
echo "[BOOT] ready; entering sleep"
exec sleep infinity

View File

@ -26,9 +26,12 @@ service_id() {
send_logs() {
local sid="$1"; local hosttag="$2"
docker exec "$sid" sh -lc 'mkdir -p /logs/train /logs/infer'
docker exec "$sid" sh -lc "ts=\$(date -u +%Y-%m-%dT%H:%M:%SZ); echo \"\$ts INFO [$hosttag] training step=1 loss=1.23 model=bert\" >> /logs/train/train-demo.log"
docker exec "$sid" sh -lc "ts=\$(date -u +%Y-%m-%dT%H:%M:%SZ); echo \"\$ts INFO [$hosttag] training step=2 loss=1.10 model=bert\" >> /logs/train/train-demo.log"
docker exec "$sid" sh -lc "ts=\$(date -u +%Y-%m-%dT%H:%M:%SZ); echo \"\$ts WARN [$hosttag] inference slow on batch=2 latency=1.9s\" >> /logs/infer/infer-demo.log"
docker exec "$sid" sh -lc "ts=\
\$(date '+%F %T'); echo \"\$ts INFO [$hosttag] training step=1 loss=1.23 model=bert\" >> /logs/train/train-demo.log"
docker exec "$sid" sh -lc "ts=\
\$(date '+%F %T'); echo \"\$ts INFO [$hosttag] training step=2 loss=1.10 model=bert\" >> /logs/train/train-demo.log"
docker exec "$sid" sh -lc "ts=\
\$(date '+%F %T'); echo \"\$ts WARN [$hosttag] inference slow on batch=2 latency=1.9s\" >> /logs/infer/infer-demo.log"
}
CID_A="$(service_id node-a)"

View File

@ -1,24 +0,0 @@
SERVER_PROJECT=argus-swarm-server
NODES_PROJECT=argus-swarm-nodes
# Host ports for server compose
MASTER_PORT=32300
ES_HTTP_PORT=9200
KIBANA_PORT=5601
PROMETHEUS_PORT=9090
GRAFANA_PORT=3000
ALERTMANAGER_PORT=9093
WEB_PROXY_PORT_8080=8080
WEB_PROXY_PORT_8081=8081
WEB_PROXY_PORT_8082=8082
WEB_PROXY_PORT_8083=8083
WEB_PROXY_PORT_8084=8084
WEB_PROXY_PORT_8085=8085
# UID/GID for volume ownership in containers
ARGUS_BUILD_UID=2133
ARGUS_BUILD_GID=2015
# Node bundle images
NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:latest
NODE_GPU_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle-gpu:latest

View File

@ -1,10 +0,0 @@
BINDIP=10.0.4.25
FTPIP=10.0.4.29
MASTER_ENDPOINT=http://master.argus.com:3000
FTP_USER=ftpuser
FTP_PASSWORD=ZGClab1234!
AGENT_ENV=lm1
AGENT_USER=yuyr
AGENT_INSTANCE=node001sX
NODE_HOSTNAME=lm1
GPU_NODE_HOSTNAME=lm1

View File

@ -1,7 +0,0 @@
private-*/
tmp/
.env
.env.nodes

View File

@ -1,94 +0,0 @@
# Swarm Tests (argus-sys-net)
快速在本机用 Docker Swarm + overlay 网络验证“服务端 + 单节点”端到端部署。保持对 `src/sys/tests` 兼容,不影响现有桥接网络测试。
## 先决条件
- Docker Engine 已启用 Swarm脚本会自动 `swarm init` 单机模式)。
- 已构建并加载以下镜像:`argus-master:latest``argus-elasticsearch:latest``argus-kibana:latest``argus-metric-prometheus:latest``argus-metric-grafana:latest``argus-alertmanager:latest``argus-web-frontend:latest``argus-web-proxy:latest`、以及节点镜像 `argus-sys-metric-test-node-bundle:latest`(见下文)。
- 本地 `UID/GID` 建议通过 `configs/build_user.local.conf` 指定,脚本会读取:
- `UID=1000`\n`GID=1000`(示例)。
## 构建节点 bundle 镜像
```
./deployment/build/build_images.sh --with-node-bundle --client-version 20251106
```
说明:`--client-version` 支持 `YYYYMMDD` 日期包或 `1.xx.yy` 组件版本。打包完成后镜像 `argus-sys-metric-test-node-bundle:latest` 会内置 `argus-metric_*.tar.gz`,容器启动时优先从本地 bundle 安装。
## 运行步骤
```
cd src/sys/swarm_tests
cp .env.example .env
bash scripts/00_bootstrap.sh
bash scripts/01_server_up.sh
bash scripts/02_wait_ready.sh # 写 MASTER_ENDPOINT/AGENT_* 到 .env.nodes
bash scripts/03_nodes_up.sh
bash scripts/04_metric_verify.sh
```
清理:
```
bash scripts/99_down.sh
```
## 说明与注意事项
- `00_bootstrap.sh`:先加载 `scripts/common/build_user.sh`,打印并写入 `.env` 中的 `ARGUS_BUILD_UID/GID`,再准备 `private-server/``private-nodes/` 目录,并 `chown` 到对应 UID/GID。
- `01_server_up.sh`:启动服务端 compose。可用 `SWARM_FIX_PERMS=1` 打开“容器内 chmod + supervisor 重启”的兜底逻辑,默认关闭。
- `02_wait_ready.sh`:等待 Master/ES/Prom/Grafana 就绪Kibana 可延迟),随后写入 `.env.nodes``MASTER_ENDPOINT/AGENT_*`,供节点 compose 使用DNS 由 Docker 自带服务负责,不再依赖 BINDIP/FTPIP
- `03_nodes_up.sh`启动单节点容器bundle 版)。容器内 `node-bootstrap.sh` 优先本地安装,成功后执行健康检查并等待 `/private/argus/agent/<hostname>/node.json` 出现。
- `04_metric_verify.sh`:在本套件内执行详细校验(不再直接调用 tests 脚本):
- Grafana `/api/health`database=ok
- Grafana 数据源指向 `prom.metric.argus.com:<port>` 并在容器内可解析该域名
- Prometheus `activeTargets` 全部 up
- `nodes.json` 不包含 `172.22/16`docker_gwbridge
## 常见问题
- Grafana/Kibana 启动报权限:检查 `configs/build_user.local.conf``00_bootstrap.sh` 的输出 UID/GID 是否一致;必要时设置 `SWARM_FIX_PERMS=1` 重新 `01_server_up.sh`
- 节点容器 fallback 到 FTP通常为 bundle 结构异常或健康检查失败(早期脚本在 `sh` 下执行)。当前 `node-bootstrap.sh` 已使用 `bash` 执行健康检查,并在本地安装成功后跳过 FTP。
- 代理 502查看容器 `argus-web-proxy``/var/log/nginx/error.log` 与启动日志中 `upstream check` 行;若后端未就绪(尤其 Kibana等待 `02_wait_ready.sh` 通过后再访问。
### 在 worker 上用 compose 起 GPU 节点的网络预热overlay not found
在多机 Swarm 场景,如果在 worker`lm1`)上直接运行 `05_gpu_node_up.sh``docker compose` 对 external overlay `argus-sys-net` 的本地预检查可能报错 `network ... not found`。这是因为 worker 尚未在本地“加入”该 overlay。
Workaround先在 worker 启一个临时容器加入 overlay 进行“网络预热”,随后再运行 GPU compose。
```
# 在 worker 节点lm1
cd src/sys/swarm_tests
set -a; source .env; source .env.nodes; set +a
# 预热 overlay默认 600s 超时自动退出,可重复执行)
bash scripts/05a_net_warmup.sh
# 然后再启动 GPU 节点
bash scripts/05_gpu_node_up.sh
```
清理时 `scripts/99_down.sh` 会顺带移除预热容器 `argus-net-warmup`
更推荐的做法是改用 `docker stack deploy` 由 manager 调度 GPU 节点(支持渐进式扩容与节点约束),详见 `specs/issues/2025-11-07-swarm-compose-worker-overlay-network-not-found-lm1.md`
### 可选Stack 部署 GPU 节点manager 上执行)
前置:已在 managerlm2完成 `00_bootstrap.sh``01_server_up.sh`,并通过 `02_wait_ready.sh` 生成 `.env.nodes`;给目标 GPU 节点打标签 `argus.gpu=true`
```
cd src/sys/swarm_tests
# 给 GPU 节点打标签(示例)
docker node update --label-add argus.gpu=true lm1
# 可按需覆盖挂载路径(每个 GPU 节点都需存在同一路径)
export AGENT_VOLUME_PATH=/data1/yuyr/dev/argus/src/sys/swarm_tests/private-gpu-nodes/argus/agent
# 在 manager 上部署global 模式,自动在打标节点各拉起 1 副本)
bash scripts/05b_gpu_stack_deploy.sh
# 查看
docker stack services argus-swarm-gpu
docker stack ps argus-swarm-gpu
```
移除 stack`docker stack rm argus-swarm-gpu`(不会删除 overlay 网络与数据目录)。

View File

@ -1,33 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
metric-gpu-node:
image: ${NODE_GPU_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle-gpu:latest}
container_name: argus-metric-gpu-node-swarm
hostname: ${GPU_NODE_HOSTNAME:-swarm-metric-gpu-001}
restart: unless-stopped
privileged: true
runtime: nvidia
environment:
- TZ=Asia/Shanghai
- DEBIAN_FRONTEND=noninteractive
- MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- AGENT_ENV=${AGENT_ENV:-dev2}
- AGENT_USER=${AGENT_USER:-yuyr}
- AGENT_INSTANCE=${AGENT_INSTANCE:-gpu001sX}
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- GPU_MODE=gpu
networks:
argus-sys-net:
aliases:
- ${AGENT_INSTANCE}.node.argus.com
volumes:
- ./private-gpu-nodes/argus/agent:/private/argus/agent
command: ["sleep", "infinity"]

View File

@ -1,31 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
metric-test-node:
image: ${NODE_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle:latest}
container_name: argus-metric-test-node-swarm
hostname: ${NODE_HOSTNAME:-swarm-metric-node-001}
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- DEBIAN_FRONTEND=noninteractive
- MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
- ES_HOST=es.log.argus.com
- ES_PORT=9200
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- AGENT_ENV=${AGENT_ENV:-dev2}
- AGENT_USER=${AGENT_USER:-yuyr}
- AGENT_INSTANCE=${AGENT_INSTANCE:-node001sX}
- CLIENT_VERSION=${CLIENT_VERSION:-}
networks:
argus-sys-net:
aliases:
- ${AGENT_INSTANCE}.node.argus.com
volumes:
- ./private-nodes/argus/agent:/private/argus/agent
command: ["sleep", "infinity"]

View File

@ -1,170 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
master:
image: ${MASTER_IMAGE_TAG:-argus-master:latest}
container_name: argus-master-sys
depends_on: []
environment:
- OFFLINE_THRESHOLD_SECONDS=6
- ONLINE_THRESHOLD_SECONDS=2
- SCHEDULER_INTERVAL_SECONDS=1
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${MASTER_PORT:-32300}:3000"
volumes:
- ./private-server/argus/master:/private/argus/master
- ./private-server/argus/metric/prometheus:/private/argus/metric/prometheus
- ./private-server/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- master.argus.com
restart: unless-stopped
es:
image: ${ES_IMAGE_TAG:-argus-elasticsearch:latest}
container_name: argus-es-sys
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms512m -Xmx512m
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ./private-server/argus/log/elasticsearch:/private/argus/log/elasticsearch
- ./private-server/argus/etc:/private/argus/etc
ports:
- "${ES_HTTP_PORT:-9200}:9200"
restart: unless-stopped
networks:
argus-sys-net:
aliases:
- es.log.argus.com
kibana:
image: ${KIBANA_IMAGE_TAG:-argus-kibana:latest}
container_name: argus-kibana-sys
environment:
- ELASTICSEARCH_HOSTS=http://es.log.argus.com:9200
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ./private-server/argus/log/kibana:/private/argus/log/kibana
- ./private-server/argus/etc:/private/argus/etc
depends_on: [es]
ports:
- "${KIBANA_PORT:-5601}:5601"
restart: unless-stopped
networks:
argus-sys-net:
aliases:
- kibana.log.argus.com
prometheus:
image: ${PROM_IMAGE_TAG:-argus-metric-prometheus:latest}
container_name: argus-prometheus
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- PROMETHEUS_BASE_PATH=/private/argus/metric/prometheus
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
volumes:
- ./private-server/argus/metric/prometheus:/private/argus/metric/prometheus
- ./private-server/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- prom.metric.argus.com
grafana:
image: ${GRAFANA_IMAGE_TAG:-argus-metric-grafana:latest}
container_name: argus-grafana
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- GRAFANA_BASE_PATH=/private/argus/metric/grafana
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- GF_SERVER_HTTP_PORT=3000
- GF_LOG_LEVEL=warn
- GF_LOG_MODE=console
- GF_PATHS_PROVISIONING=/private/argus/metric/grafana/provisioning
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
ports:
- "${GRAFANA_PORT:-3000}:3000"
volumes:
- ./private-server/argus/metric/grafana:/private/argus/metric/grafana
- ./private-server/argus/etc:/private/argus/etc
depends_on: [prometheus]
networks:
argus-sys-net:
aliases:
- grafana.metric.argus.com
alertmanager:
image: ${ALERT_IMAGE_TAG:-argus-alertmanager:latest}
container_name: argus-alertmanager
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ./private-server/argus/etc:/private/argus/etc
- ./private-server/argus/alert/alertmanager:/private/argus/alert/alertmanager
networks:
argus-sys-net:
aliases:
- alertmanager.alert.argus.com
ports:
- "${ALERTMANAGER_PORT:-9093}:9093"
restart: unless-stopped
web-frontend:
image: ${FRONT_IMAGE_TAG:-argus-web-frontend:latest}
container_name: argus-web-frontend
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- EXTERNAL_MASTER_PORT=${WEB_PROXY_PORT_8085:-8085}
- EXTERNAL_ALERTMANAGER_PORT=${WEB_PROXY_PORT_8084:-8084}
- EXTERNAL_GRAFANA_PORT=${WEB_PROXY_PORT_8081:-8081}
- EXTERNAL_PROMETHEUS_PORT=${WEB_PROXY_PORT_8082:-8082}
- EXTERNAL_KIBANA_PORT=${WEB_PROXY_PORT_8083:-8083}
volumes:
- ./private-server/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- web.argus.com
restart: unless-stopped
web-proxy:
image: ${WEB_PROXY_IMAGE_TAG:-argus-web-proxy:latest}
container_name: argus-web-proxy
depends_on: [master, grafana, prometheus, kibana, alertmanager]
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ./private-server/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- proxy.argus.com
ports:
- "${WEB_PROXY_PORT_8080:-8080}:8080"
- "${WEB_PROXY_PORT_8081:-8081}:8081"
- "${WEB_PROXY_PORT_8082:-8082}:8082"
- "${WEB_PROXY_PORT_8083:-8083}:8083"
- "${WEB_PROXY_PORT_8084:-8084}:8084"
- "${WEB_PROXY_PORT_8085:-8085}:8085"
restart: unless-stopped

View File

@ -1,91 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
REPO_ROOT="$(cd "$ROOT/../../.." && pwd)"
ENV_FILE="$ROOT/.env"; [[ -f "$ENV_FILE" ]] || cp "$ROOT/.env.example" "$ENV_FILE"
# Load build user (UID/GID) from repo config to match container runtime users
if [[ -f "$REPO_ROOT/scripts/common/build_user.sh" ]]; then
# shellcheck disable=SC1091
source "$REPO_ROOT/scripts/common/build_user.sh" 2>/dev/null || true
if declare -f load_build_user >/dev/null 2>&1; then
load_build_user
fi
fi
# Capture resolved UID/GID from build_user before sourcing .env
uid_resolved="${ARGUS_BUILD_UID:-2133}"
gid_resolved="${ARGUS_BUILD_GID:-2015}"
echo "[BOOT] resolved build user: UID=${uid_resolved} GID=${gid_resolved} (from scripts/common/build_user.sh or env)"
# After resolving UID/GID, load .env for other settings; then we will overwrite UID/GID entries
set -a; source "$ENV_FILE"; set +a
echo "[BOOT] checking Docker Swarm"
if ! docker info 2>/dev/null | grep -q "Swarm: active"; then
echo "[BOOT] initializing swarm (single-node)"
docker swarm init >/dev/null 2>&1 || true
fi
NET_NAME=argus-sys-net
if docker network inspect "$NET_NAME" >/dev/null 2>&1; then
echo "[BOOT] overlay network exists: $NET_NAME"
else
echo "[BOOT] creating overlay network: $NET_NAME"
docker network create -d overlay --attachable "$NET_NAME"
fi
echo "[BOOT] preparing private directories (server/nodes)"
# Server-side dirs (align with sys/tests 01_bootstrap.sh)
mkdir -p \
"$ROOT/private-server/argus/etc" \
"$ROOT/private-server/argus/master" \
"$ROOT/private-server/argus/metric/prometheus" \
"$ROOT/private-server/argus/metric/prometheus/data" \
"$ROOT/private-server/argus/metric/prometheus/rules" \
"$ROOT/private-server/argus/metric/prometheus/targets" \
"$ROOT/private-server/argus/alert/alertmanager" \
"$ROOT/private-server/argus/metric/ftp/share" \
"$ROOT/private-server/argus/metric/grafana/data" \
"$ROOT/private-server/argus/metric/grafana/logs" \
"$ROOT/private-server/argus/metric/grafana/plugins" \
"$ROOT/private-server/argus/metric/grafana/provisioning/datasources" \
"$ROOT/private-server/argus/metric/grafana/provisioning/dashboards" \
"$ROOT/private-server/argus/metric/grafana/data/sessions" \
"$ROOT/private-server/argus/metric/grafana/data/dashboards" \
"$ROOT/private-server/argus/metric/grafana/config" \
"$ROOT/private-server/argus/agent" \
"$ROOT/private-server/argus/log/elasticsearch" \
"$ROOT/private-server/argus/log/kibana"
mkdir -p "$ROOT/private-nodes/argus/agent"
uid="$uid_resolved"; gid="$gid_resolved"
echo "[BOOT] chown -R ${uid}:${gid} for server core dirs (best-effort)"
chown -R "$uid":"$gid" \
"$ROOT/private-server/argus/log/elasticsearch" \
"$ROOT/private-server/argus/log/kibana" \
"$ROOT/private-server/argus/metric/grafana" \
"$ROOT/private-server/argus/metric/prometheus" \
"$ROOT/private-server/argus/alert" \
"$ROOT/private-server/argus/agent" \
"$ROOT/private-server/argus/etc" 2>/dev/null || true
chmod -R g+w "$ROOT/private-server/argus/alert" "$ROOT/private-server/argus/etc" 2>/dev/null || true
# ensure .env carries the resolved UID/GID for compose env interpolation
if grep -q '^ARGUS_BUILD_UID=' "$ENV_FILE"; then
sed -i "s/^ARGUS_BUILD_UID=.*/ARGUS_BUILD_UID=${uid}/" "$ENV_FILE"
else
echo "ARGUS_BUILD_UID=${uid}" >> "$ENV_FILE"
fi
if grep -q '^ARGUS_BUILD_GID=' "$ENV_FILE"; then
sed -i "s/^ARGUS_BUILD_GID=.*/ARGUS_BUILD_GID=${gid}/" "$ENV_FILE"
else
echo "ARGUS_BUILD_GID=${gid}" >> "$ENV_FILE"
fi
echo "[BOOT] done"

View File

@ -1,39 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
REPO_ROOT="$(cd "$ROOT/../../.." && pwd)"
ENV_FILE="$ROOT/.env"
# load UID/GID from repo config first (so they take precedence over any stale .env values)
if [[ -f "$REPO_ROOT/scripts/common/build_user.sh" ]]; then
# shellcheck disable=SC1091
source "$REPO_ROOT/scripts/common/build_user.sh" 2>/dev/null || true
if declare -f load_build_user >/dev/null 2>&1; then
load_build_user
fi
fi
set -a; source "$ENV_FILE"; set +a
PROJECT="${SERVER_PROJECT:-argus-swarm-server}"
COMPOSE_FILE="$ROOT/docker-compose.server.yml"
echo "[SERVER] starting compose project: $PROJECT"
docker compose -p "$PROJECT" -f "$COMPOSE_FILE" up -d
echo "[SERVER] containers:"; docker compose -p "$PROJECT" -f "$COMPOSE_FILE" ps
# Optional post-start permission alignment (disabled by default). Enable with SWARM_FIX_PERMS=1
if [[ "${SWARM_FIX_PERMS:-0}" == "1" ]]; then
echo "[SERVER] aligning permissions in containers (best-effort)"
for c in argus-master-sys argus-prometheus argus-grafana argus-ftp argus-es-sys argus-kibana-sys argus-web-frontend argus-web-proxy argus-alertmanager; do
docker exec "$c" sh -lc 'mkdir -p /private/argus && chmod -R 777 /private/argus' 2>/dev/null || true
done
echo "[SERVER] restarting selected supervised programs to pick up new permissions"
docker exec argus-prometheus sh -lc 'supervisorctl restart prometheus targets-updater >/dev/null 2>&1 || true' || true
docker exec argus-grafana sh -lc 'rm -f /private/argus/etc/grafana.metric.argus.com 2>/dev/null || true; supervisorctl restart grafana >/dev/null 2>&1 || true' || true
docker exec argus-es-sys sh -lc 'supervisorctl restart elasticsearch >/dev/null 2>&1 || true' || true
docker exec argus-kibana-sys sh -lc 'supervisorctl restart kibana >/dev/null 2>&1 || true' || true
fi
echo "[SERVER] done"

View File

@ -1,47 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; set -a; source "$ENV_FILE"; set +a
PROJECT="${SERVER_PROJECT:-argus-swarm-server}"
RETRIES=${RETRIES:-60}
SLEEP=${SLEEP:-5}
code() { curl -4 -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
prom_ok() {
# Consider ready if TCP:9090 is accepting on localhost (host side)
(exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT:-9090}) >/dev/null 2>&1 && return 0
return 1
}
echo "[READY] waiting services (max $((RETRIES*SLEEP))s)"
for i in $(seq 1 "$RETRIES"); do
e1=$(code "http://127.0.0.1:${MASTER_PORT:-32300}/readyz")
e2=$(code "http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health")
e3=000
if prom_ok; then e3=200; fi
e4=$(code "http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health")
e5=$(code "http://127.0.0.1:${KIBANA_PORT:-5601}/api/status")
ok=0
[[ "$e1" == 200 ]] && ok=$((ok+1))
[[ "$e2" == 200 ]] && ok=$((ok+1))
[[ "$e3" == 200 ]] && ok=$((ok+1))
[[ "$e4" == 200 ]] && ok=$((ok+1))
# Kibana 可放宽,等其它四项即可
if [[ $ok -ge 4 ]]; then echo "[READY] base services OK"; break; fi
echo "[..] waiting ($i/$RETRIES): master=$e1 es=$e2 prom=$e3 graf=$e4 kibana=$e5"; sleep "$SLEEP"
done
if [[ $ok -lt 4 ]]; then echo "[ERROR] services not ready" >&2; exit 1; fi
ENV_NODES="$ROOT/.env.nodes"
cat > "$ENV_NODES" <<EOF
MASTER_ENDPOINT=http://master.argus.com:3000
AGENT_ENV=dev2
AGENT_USER=yuyr
AGENT_INSTANCE=node001sX
EOF
echo "[READY] wrote $ENV_NODES (MASTER_ENDPOINT/AGENT_* only)"

View File

@ -1,16 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; set -a; source "$ENV_FILE"; set +a
ENV_NODES_FILE="$ROOT/.env.nodes"; set -a; source "$ENV_NODES_FILE"; set +a
PROJECT="${NODES_PROJECT:-argus-swarm-nodes}"
COMPOSE_FILE="$ROOT/docker-compose.nodes.yml"
echo "[NODES] starting compose project: $PROJECT"
docker compose -p "$PROJECT" --env-file "$ENV_NODES_FILE" -f "$COMPOSE_FILE" up -d
docker compose -p "$PROJECT" -f "$COMPOSE_FILE" ps
echo "[NODES] done"

View File

@ -1,268 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; [[ -f "$ENV_FILE" ]] && { set -a; source "$ENV_FILE"; set +a; }
PROM_PORT="${PROMETHEUS_PORT:-9090}"
GRAF_PORT="${GRAFANA_PORT:-3000}"
GRAF_URL="http://127.0.0.1:${GRAF_PORT}"
PROM_DOMAIN="prom.metric.argus.com:${PROM_PORT}"
NODE_CONT="${SWARM_NODE_CNAME:-argus-metric-test-node-swarm}"
err() { echo "[ERR] $*" >&2; }
ok() { echo "[OK] $*"; }
info(){ echo "[INFO] $*"; }
fail() { err "$*"; exit 1; }
# Ensure fluent-bit is installed, configured and running to ship logs to ES
# Best-effort remediation for swarm_tests only (does not change repo sources)
ensure_fluentbit() {
local cname="$1"
# 1) ensure process exists or try local bundle installer
if ! docker exec "$cname" pgrep -x fluent-bit >/dev/null 2>&1; then
docker exec "$cname" bash -lc '
set -e
root=/opt/argus-metric/versions
ver=$(ls -1 "$root" 2>/dev/null | sort -Vr | head -1 || true)
[[ -z "$ver" ]] && ver=1.42.0
verdir="$root/$ver"
tb=$(ls -1 "$verdir"/fluent-bit-*.tar.gz 2>/dev/null | head -1 || true)
if [ -n "$tb" ]; then tmp=$(mktemp -d); tar -xzf "$tb" -C "$tmp"; sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true); [ -n "$sub" ] && (cd "$sub" && ./install.sh "$verdir") || true; fi
' >/dev/null 2>&1 || true
fi
# 2) patch configs using literal placeholders with safe delimiter
docker exec "$cname" bash -lc '
set -e
f=/etc/fluent-bit/fluent-bit.conf
o=/etc/fluent-bit/outputs.d/10-es.conf
LCL="\${CLUSTER}"; LRA="\${RACK}"; LHN="\${HOSTNAME}"; EH="\${ES_HOST:-localhost}"; EP="\${ES_PORT:-9200}"
# record_modifier placeholders
if grep -q "Record cluster $LCL" "$f"; then sed -i "s|Record cluster $LCL|Record cluster local|" "$f"; fi
if grep -q "Record rack $LRA" "$f"; then sed -i "s|Record rack $LRA|Record rack dev|" "$f"; fi
if grep -q "Record host $LHN" "$f"; then hn=$(hostname); sed -i "s|Record host $LHN|Record host ${hn}|" "$f"; fi
# outputs placeholders
if [ -f "$o" ] && (grep -q "$EH" "$o" || grep -q "$EP" "$o"); then
sed -i "s|Host $EH|Host es.log.argus.com|g; s|Port $EP|Port 9200|g" "$o"
fi
# ensure parser supports ISO8601 with timezone
p=/etc/fluent-bit/parsers.conf
if [ -f "$p" ]; then
if grep -q "Time_Format %Y-%m-%d %H:%M:%S" "$p"; then
sed -i "s|Time_Format %Y-%m-%d %H:%M:%S|Time_Format %Y-%m-%dT%H:%M:%S%z|" "$p"
fi
if grep -q "Regex ^(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})\\s+" "$p"; then
sed -i "s|Regex ^(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})\\s+|Regex ^(?<timestamp>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}(?:Z|[+-]\\d{2}:?\\d{2}))\\s+|" "$p"
fi
fi
' >/dev/null 2>&1 || true
# 3) restart fluent-bit (best-effort) and wait
docker exec "$cname" bash -lc 'pkill -x fluent-bit >/dev/null 2>&1 || true; sleep 1; setsid su -s /bin/bash fluent-bit -c "/opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf >> /var/log/fluent-bit.log 2>&1" &>/dev/null & echo ok' >/dev/null 2>&1 || true
for i in {1..10}; do if docker exec "$cname" pgrep -x fluent-bit >/dev/null 2>&1; then return 0; fi; sleep 1; done
echo "[WARN] fluent-bit not confirmed running; log pipeline may not ingest" >&2
}
# ---- Grafana /api/health ----
info "Grafana /api/health"
HEALTH_JSON="$ROOT/tmp/metric-verify/graf_health.json"
mkdir -p "$(dirname "$HEALTH_JSON")"
code=$(curl -fsS -o "$HEALTH_JSON" -w '%{http_code}' --max-time 10 "$GRAF_URL/api/health" || true)
[[ "$code" == 200 ]] || fail "/api/health HTTP $code"
if grep -q '"database"\s*:\s*"ok"' "$HEALTH_JSON"; then ok "grafana health database=ok"; else fail "grafana health not ok: $(cat "$HEALTH_JSON")"; fi
# ---- Grafana datasource points to prom domain ----
info "Grafana datasource URL uses domain: $PROM_DOMAIN"
DS_FILE="/private/argus/metric/grafana/provisioning/datasources/datasources.yml"
if ! docker exec argus-grafana sh -lc "test -f $DS_FILE" >/dev/null 2>&1; then
DS_FILE="/etc/grafana/provisioning/datasources/datasources.yml"
fi
docker exec argus-grafana sh -lc "grep -E 'url:\s*http://$PROM_DOMAIN' '$DS_FILE'" >/dev/null 2>&1 || fail "datasource not pointing to $PROM_DOMAIN"
ok "datasource points to domain"
# ---- DNS resolution inside grafana (via Docker DNS + FQDN alias) ----
info "FQDN resolution inside grafana (Docker DNS)"
tries=0
until docker exec argus-grafana getent hosts prom.metric.argus.com >/dev/null 2>&1; do
tries=$((tries+1)); (( tries > 24 )) && fail "grafana cannot resolve prom.metric.argus.com"
echo "[..] waiting DNS propagation in grafana ($tries/24)"; sleep 5
done
ok "domain resolves"
# ---- Prometheus activeTargets down check ----
info "Prometheus activeTargets health"
targets_json="$ROOT/tmp/metric-verify/prom_targets.json"
curl -fsS "http://127.0.0.1:${PROM_PORT}/api/v1/targets" -o "$targets_json" || { echo "[WARN] fetch targets failed" >&2; }
down_all=""
if command -v jq >/dev/null 2>&1; then
down_all=$(jq -r '.data.activeTargets[] | select(.health=="down") | .scrapeUrl' "$targets_json" 2>/dev/null || true)
else
down_all=$(grep -o '"scrapeUrl":"[^"]\+"' "$targets_json" | sed 's/"scrapeUrl":"\(.*\)"/\1/' | paste -sd '\n' - | grep -v '^$' || true)
grep -q '"health":"down"' "$targets_json" && [ -z "$down_all" ] && down_all="(one or more targets down)"
fi
# ignore dcgm-exporter(9400) and tolerate node-exporter(9100) in swarm tests
down_filtered=$(echo "$down_all" | grep -Ev ':(9400|9100)/' || true)
if [[ -n "$down_filtered" ]]; then
err "prometheus down targets (filtered):"; echo "$down_filtered" >&2
else
ok "prometheus targets up (ignoring :9100 and :9400)"
fi
# ---- nodes.json sanity: avoid 172.22/16 (gwbridge) ----
nodes_json="$ROOT/private-server/argus/metric/prometheus/nodes.json"
if [[ -f "$nodes_json" ]] && grep -q '"ip"\s*:\s*"172\.22\.' "$nodes_json"; then
fail "nodes.json contains 172.22/16 addresses (gwbridge)"
fi
ok "nodes.json IPs look fine"
echo "[DONE] metric verify"
# ---- Log pipeline smoke test (adapted from sys/tests 07) ----
info "Log pipeline: send logs in node container and assert ES counts"
ES_PORT="${ES_HTTP_PORT:-9200}"
KIBANA_PORT="${KIBANA_PORT:-5601}"
get_count() {
local idx="$1"; local tmp; tmp=$(mktemp)
local code
code=$(curl -s -o "$tmp" -w "%{http_code}" "http://127.0.0.1:${ES_PORT}/${idx}/_count?ignore_unavailable=true&allow_no_indices=true" || true)
if [[ "$code" == "200" ]]; then
local val
val=$(jq -r '(.count // 0) | tonumber? // 0' "$tmp" 2>/dev/null || echo 0)
echo "$val"
else
echo 0
fi
rm -f "$tmp"
}
train0=$(get_count "train-*")
infer0=$(get_count "infer-*")
base=$((train0 + infer0))
info "initial ES counts: train=${train0} infer=${infer0} total=${base}"
send_logs() {
local cname="$1"; local hosttag="$2"
docker exec "$cname" sh -lc 'mkdir -p /logs/train /logs/infer'
docker exec "$cname" sh -lc "ts=\$(date -u +%Y-%m-%dT%H:%M:%SZ); echo \"\$ts INFO [$hosttag] training step=1 loss=1.23 model=bert\" >> /logs/train/train-demo.log"
docker exec "$cname" sh -lc "ts=\$(date -u +%Y-%m-%dT%H:%M:%SZ); echo \"\$ts INFO [$hosttag] training step=2 loss=1.10 model=bert\" >> /logs/train/train-demo.log"
docker exec "$cname" sh -lc "ts=\$(date -u +%Y-%m-%dT%H:%M:%SZ); echo \"\$ts WARN [$hosttag] inference slow on batch=2 latency=1.9s\" >> /logs/infer/infer-demo.log"
}
ensure_fluentbit "$NODE_CONT"
# ensure fluent-bit process is really up before sending logs,
# to avoid dropping lines when tail starts after we write test logs
FLUENT_WAIT_RETRIES="${FLUENT_WAIT_RETRIES:-120}"
FLUENT_WAIT_SLEEP="${FLUENT_WAIT_SLEEP:-2}"
fluent_ok=0
for i in $(seq 1 "$FLUENT_WAIT_RETRIES"); do
if docker exec "$NODE_CONT" pgrep -x fluent-bit >/dev/null 2>&1; then
fluent_ok=1
break
fi
echo "[..] waiting fluent-bit process up in node ($i/$FLUENT_WAIT_RETRIES)"
sleep "$FLUENT_WAIT_SLEEP"
done
if [[ "$fluent_ok" -ne 1 ]]; then
fail "fluent-bit not running in node after waiting $((FLUENT_WAIT_RETRIES * FLUENT_WAIT_SLEEP))s"
fi
send_logs "$NODE_CONT" "swarm-node"
info "waiting for ES to ingest..."
curl -s -X POST "http://127.0.0.1:${ES_PORT}/train-*/_refresh" >/dev/null 2>&1 || true
curl -s -X POST "http://127.0.0.1:${ES_PORT}/infer-*/_refresh" >/dev/null 2>&1 || true
final=0; threshold=3
for attempt in {1..60}; do
train1=$(get_count "train-*"); infer1=$(get_count "infer-*"); final=$((train1 + infer1))
if (( final > base && final >= threshold )); then break; fi
echo "[..] waiting ES counts increase to >=${threshold} ($attempt/60) current=${final} base=${base}"; \
curl -s -X POST "http://127.0.0.1:${ES_PORT}/train-*/_refresh" >/dev/null 2>&1 || true; \
curl -s -X POST "http://127.0.0.1:${ES_PORT}/infer-*/_refresh" >/dev/null 2>&1 || true; \
sleep 2
done
info "final ES counts: train=${train1} infer=${infer1} total=${final}"
(( final > base )) || fail "ES total did not increase (${base} -> ${final})"
(( final >= threshold )) || fail "ES total below expected threshold: ${final} < ${threshold}"
es_health=$(curl -s "http://127.0.0.1:${ES_PORT}/_cluster/health" | grep -o '"status":"[^\"]*"' | cut -d'"' -f4)
[[ "$es_health" == green || "$es_health" == yellow ]] || fail "ES health not green/yellow: $es_health"
if ! curl -fs "http://127.0.0.1:${KIBANA_PORT}/api/status" >/dev/null 2>&1; then
echo "[WARN] Kibana status endpoint not available" >&2
fi
ok "log pipeline verified"
# ---- Node status and health (node.json + metric-*) ----
info "Node status and health (node.json + metric components)"
NODE_HEALTH_RETRIES="${NODE_HEALTH_RETRIES:-5}"
NODE_HEALTH_SLEEP="${NODE_HEALTH_SLEEP:-5}"
if ! command -v jq >/dev/null 2>&1; then
fail "node health: jq not available on host; cannot parse node.json"
fi
node_health_ok=0
for attempt in $(seq 1 "$NODE_HEALTH_RETRIES"); do
tmp_node_json="$(mktemp)"
if ! docker exec "$NODE_CONT" sh -lc '
set -e
host="$(hostname)"
f="/private/argus/agent/${host}/node.json"
if [ ! -s "$f" ]; then
echo "[ERR] node.json missing or empty: $f" >&2
exit 1
fi
cat "$f"
' > "$tmp_node_json" 2>/dev/null; then
rm -f "$tmp_node_json"
info "node health: node.json not ready (attempt $attempt/$NODE_HEALTH_RETRIES)"
else
node_name="$(jq -r '.name // ""' "$tmp_node_json")"
node_status="$(jq -r '.status // ""' "$tmp_node_json")"
node_type="$(jq -r '.type // ""' "$tmp_node_json")"
if [[ -z "$node_name" || -z "$node_status" || -z "$node_type" ]]; then
info "node health: missing required fields in node.json (attempt $attempt/$NODE_HEALTH_RETRIES)"
elif [[ "$node_status" != "online" || "$node_type" != "agent" ]]; then
info "node health: status/type not ready yet (status=$node_status type=$node_type name=$node_name attempt $attempt/$NODE_HEALTH_RETRIES)"
else
all_ok=1
for comp in metric-argus-agent metric-node-exporter metric-dcgm-exporter metric-fluent-bit; do
cstatus="$(jq -r --arg c "$comp" '.health[$c].status // ""' "$tmp_node_json")"
cerror="$(jq -r --arg c "$comp" '.health[$c].error // ""' "$tmp_node_json")"
if [[ "$cstatus" != "healthy" ]]; then
info "node health: $comp status=$cstatus (attempt $attempt/$NODE_HEALTH_RETRIES)"
all_ok=0
break
fi
if [[ -n "$cerror" && "$cerror" != "null" ]]; then
info "node health: $comp error=$cerror (attempt $attempt/$NODE_HEALTH_RETRIES)"
all_ok=0
break
fi
done
if [[ "$all_ok" -eq 1 ]]; then
node_health_ok=1
rm -f "$tmp_node_json"
break
fi
fi
rm -f "$tmp_node_json"
fi
if [[ "$attempt" -lt "$NODE_HEALTH_RETRIES" ]]; then
sleep "$NODE_HEALTH_SLEEP"
fi
done
if [[ "$node_health_ok" -ne 1 ]]; then
fail "node health: node.json or metric components not healthy after ${NODE_HEALTH_RETRIES} attempts"
fi
ok "node status online and metric components healthy"

View File

@ -1,48 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; set -a; source "$ENV_FILE"; set +a
ENV_NODES_FILE="$ROOT/.env.nodes"; set -a; source "$ENV_NODES_FILE"; set +a
PROJECT="${NODES_PROJECT:-argus-swarm-nodes}"
COMPOSE_FILE="$ROOT/docker-compose.nodes.yml"
NODE_CONT="${SWARM_NODE_CNAME:-argus-metric-test-node-swarm}"
echo "[RESTART] restarting node compose project: $PROJECT"
docker compose -p "$PROJECT" -f "$COMPOSE_FILE" restart
echo "[RESTART] waiting node container up: $NODE_CONT"
for i in {1..30}; do
state=$(docker ps --format '{{.Names}} {{.Status}}' | awk -v c="$NODE_CONT" '$1==c{print $2}' || true)
if [[ "$state" == Up* ]]; then
echo "[RESTART] node container is up"
break
fi
echo "[..] waiting node container up ($i/30)"
sleep 2
done
NODE_HEALTH_WAIT="${NODE_HEALTH_WAIT:-300}"
attempts=$(( NODE_HEALTH_WAIT / 30 ))
(( attempts < 1 )) && attempts=1
echo "[RESTART] waiting node health to recover (timeout=${NODE_HEALTH_WAIT}s)"
ok_flag=0
for i in $(seq 1 "$attempts"); do
if bash "$SCRIPT_DIR/04_metric_verify.sh"; then
echo "[RESTART] node restart verify passed on attempt $i/$attempts"
ok_flag=1
break
fi
echo "[..] 04_metric_verify failed after node restart; retrying ($i/$attempts)"
sleep 30
done
if [[ "$ok_flag" -ne 1 ]]; then
echo "[ERR] node restart: 04_metric_verify did not pass within ${NODE_HEALTH_WAIT}s" >&2
exit 1
fi

View File

@ -1,22 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; set -a; source "$ENV_FILE"; set +a
PROJECT="${SERVER_PROJECT:-argus-swarm-server}"
COMPOSE_FILE="$ROOT/docker-compose.server.yml"
echo "[RESTART] restarting server compose project: $PROJECT"
docker compose -p "$PROJECT" -f "$COMPOSE_FILE" restart
echo "[RESTART] waiting server ready after restart"
bash "$SCRIPT_DIR/02_wait_ready.sh"
echo "[RESTART] running 04_metric_verify after server restart"
bash "$SCRIPT_DIR/04_metric_verify.sh"
echo "[RESTART] server restart + verify passed"

View File

@ -1,33 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; [[ -f "$ENV_FILE" ]] && { set -a; source "$ENV_FILE"; set +a; }
ENV_NODES_FILE="$ROOT/.env.nodes"; [[ -f "$ENV_NODES_FILE" ]] && { set -a; source "$ENV_NODES_FILE"; set +a; }
PROJECT="${GPU_PROJECT:-argus-swarm-gpu}"
COMPOSE_FILE="$ROOT/docker-compose.gpu-node.yml"
# Prepare private dir
mkdir -p "$ROOT/private-gpu-nodes/argus/agent"
echo "[GPU] checking host NVIDIA driver/runtime"
if ! command -v nvidia-smi >/dev/null 2>&1; then
echo "[ERR] nvidia-smi not found on host; install NVIDIA driver/runtime first" >&2
exit 1
fi
echo "[GPU] starting compose project: $PROJECT"
docker compose -p "$PROJECT" --env-file "$ENV_NODES_FILE" -f "$COMPOSE_FILE" up -d
docker compose -p "$PROJECT" -f "$COMPOSE_FILE" ps
echo "[GPU] container GPU visibility"
if ! docker exec argus-metric-gpu-node-swarm nvidia-smi -L >/dev/null 2>&1; then
echo "[WARN] nvidia-smi failed inside container; check --gpus/runtime/driver" >&2
else
docker exec argus-metric-gpu-node-swarm nvidia-smi -L || true
fi
echo "[GPU] done"

View File

@ -1,44 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; [[ -f "$ENV_FILE" ]] && { set -a; source "$ENV_FILE"; set +a; }
ENV_NODES_FILE="$ROOT/.env.nodes"; [[ -f "$ENV_NODES_FILE" ]] && { set -a; source "$ENV_NODES_FILE"; set +a; }
NET_NAME="${NET_NAME:-argus-sys-net}"
WARMUP_NAME="${WARMUP_NAME:-argus-net-warmup}"
WARMUP_IMAGE="${WARMUP_IMAGE:-busybox:latest}"
WARMUP_SECONDS="${WARMUP_SECONDS:-600}"
echo "[NET] warming up overlay network on worker: ${NET_NAME}"
if docker ps --format '{{.Names}}' | grep -q "^${WARMUP_NAME}$"; then
echo "[NET] warmup container already running: ${WARMUP_NAME}"
else
docker image inspect "$WARMUP_IMAGE" >/dev/null 2>&1 || docker pull "$WARMUP_IMAGE"
set +e
docker run -d --rm \
--name "$WARMUP_NAME" \
--network "$NET_NAME" \
"$WARMUP_IMAGE" sleep "$WARMUP_SECONDS"
rc=$?
set -e
if [[ $rc -ne 0 ]]; then
echo "[ERR] failed to start warmup container on network ${NET_NAME}. Is the overlay created with --attachable on manager?" >&2
exit 1
fi
fi
echo "[NET] waiting for local engine to see network (${NET_NAME})"
for i in {1..60}; do
if docker network inspect "$NET_NAME" >/dev/null 2>&1; then
echo "[NET] overlay visible locally now. You can run GPU compose."
docker network ls | grep -E "\b${NET_NAME}\b" || true
exit 0
fi
sleep 1
done
echo "[WARN] network still not inspectable locally after 60s, but warmup container is running. Compose may still pass; proceed to run GPU compose and retry if needed." >&2
exit 0

View File

@ -1,73 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/.env"; [[ -f "$ENV_FILE" ]] && { set -a; source "$ENV_FILE"; set +a; }
PROM_PORT="${PROMETHEUS_PORT:-9090}"
GRAF_PORT="${GRAFANA_PORT:-3000}"
ok(){ echo "[OK] $*"; }
warn(){ echo "[WARN] $*"; }
err(){ echo "[ERR] $*" >&2; }
fail(){ err "$*"; exit 1; }
GPU_HOST="${GPU_NODE_HOSTNAME:-swarm-metric-gpu-001}"
# 1) nodes.json contains gpu node hostname
NODES_JSON="$ROOT/private-server/argus/metric/prometheus/nodes.json"
if [[ ! -f "$NODES_JSON" ]]; then
warn "nodes.json not found at $NODES_JSON"
else
if jq -e --arg h "$GPU_HOST" '.[] | select(.hostname==$h)' "$NODES_JSON" >/dev/null 2>&1; then
ok "nodes.json contains $GPU_HOST"
else
warn "nodes.json does not list $GPU_HOST"
fi
fi
# 2) Prometheus targets health for :9100 (must) and :9400 (optional)
targets_json="$ROOT/tmp/gpu-verify/targets.json"; mkdir -p "$(dirname "$targets_json")"
if ! curl -fsS "http://127.0.0.1:${PROM_PORT}/api/v1/targets" -o "$targets_json"; then
fail "failed to fetch Prometheus targets"
fi
# derive gpu node overlay IP
GPU_IP=$(docker inspect -f '{{ (index .NetworkSettings.Networks "argus-sys-net").IPAddress }}' argus-metric-gpu-node-swarm 2>/dev/null || true)
must_ok=false
if jq -e --arg ip "$GPU_IP" '.data.activeTargets[] | select(.scrapeUrl | contains($ip+":9100")) | select(.health=="up")' "$targets_json" >/dev/null 2>&1; then
ok "node-exporter 9100 up for GPU node ($GPU_IP)"
must_ok=true
else
# fallback: any 9100 up
if jq -e '.data.activeTargets[] | select(.scrapeUrl | test(":9100")) | select(.health=="up")' "$targets_json" >/dev/null 2>&1; then
ok "node-exporter 9100 has at least one up target (fallback)"
must_ok=true
else
fail "node-exporter 9100 has no up targets"
fi
fi
if jq -e --arg ip "$GPU_IP" '.data.activeTargets[] | select(.scrapeUrl | contains($ip+":9400")) | select(.health=="up")' "$targets_json" >/dev/null 2>&1; then
ok "dcgm-exporter 9400 up for GPU node"
else
if jq -e '.data.activeTargets[] | select(.scrapeUrl | test(":9400")) | select(.health=="up")' "$targets_json" >/dev/null 2>&1; then
ok "dcgm-exporter 9400 has up target (not necessarily GPU node)"
else
warn "dcgm-exporter 9400 down or missing (acceptable in some envs)"
fi
fi
# 3) Quick PromQL sample for DCGM metric (optional)
if curl -fsS "http://127.0.0.1:${PROM_PORT}/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL" -o "$ROOT/tmp/gpu-verify/dcgm.json"; then
if jq -e '.data.result | length > 0' "$ROOT/tmp/gpu-verify/dcgm.json" >/dev/null 2>&1; then
ok "DCGM_FI_DEV_GPU_UTIL has samples"
else
warn "no samples for DCGM_FI_DEV_GPU_UTIL (not blocking)"
fi
fi
echo "[DONE] gpu metric verify"

View File

@ -1,46 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
echo "[E2E] starting full swarm_tests E2E (cleanup -> 00-04 -> restart server/node -> keep env)"
if [[ "${E2E_SKIP_CLEAN:-0}" != "1" ]]; then
echo "[E2E] cleaning previous environment via 99_down.sh"
bash "$SCRIPT_DIR/99_down.sh" || true
else
echo "[E2E] skipping cleanup (E2E_SKIP_CLEAN=1)"
fi
echo "[E2E] running 00_bootstrap"
bash "$SCRIPT_DIR/00_bootstrap.sh"
echo "[E2E] running 01_server_up"
bash "$SCRIPT_DIR/01_server_up.sh"
echo "[E2E] running 02_wait_ready"
bash "$SCRIPT_DIR/02_wait_ready.sh"
echo "[E2E] running 03_nodes_up"
bash "$SCRIPT_DIR/03_nodes_up.sh"
echo "[E2E] baseline 04_metric_verify"
bash "$SCRIPT_DIR/04_metric_verify.sh"
if [[ "${E2E_SKIP_SERVER_RESTART:-0}" != "1" ]]; then
echo "[E2E] server restart + verify"
bash "$SCRIPT_DIR/04_restart_server_and_verify.sh"
else
echo "[E2E] skipping server restart (E2E_SKIP_SERVER_RESTART=1)"
fi
if [[ "${E2E_SKIP_NODE_RESTART:-0}" != "1" ]]; then
echo "[E2E] node restart + verify"
bash "$SCRIPT_DIR/04_restart_node_and_verify.sh"
else
echo "[E2E] skipping node restart (E2E_SKIP_NODE_RESTART=1)"
fi
echo "[E2E] done; environment kept for inspection"

Some files were not shown because too many files have changed in this diff Show More