[#50 ] 完成在x86_64平台上使用buildx来构建arm镜像；重新构建了core/web/prom/grafana以及cpu node镜像；跑通swarm test for arm; 暂时移除dcgm和log路径功能

[#49 ] 优化swarm test支持自动reboot和verify
[#49 ] swarm test重启测试通过
2025-11-24 16:31:39 +08:00 · 2025-11-20 15:21:18 +08:00 · 2025-11-19 17:26:26 +08:00 · 2025-11-18 15:10:38 +08:00 · 2025-11-17 12:17:11 +08:00 · 2025-11-14 16:43:34 +08:00
494 changed files with 49335 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1 @@
 src/metric/client-plugins/all-in-one-full/plugins/*/bin/* filter=lfs diff=lfs merge=lfs -text
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1 @@
 .idea/
--- a/README.md
+++ b/README.md
@ -5,3 +5,10 @@
 项目文档：【腾讯文档】GPU集群运维系统
 https://docs.qq.com/doc/DQUxDdmhIZ1dpeERk
 ## 构建账号配置
 镜像构建和运行账号的 UID/GID 可通过 `configs/build_user.conf` 配置，详细说明见 `doc/build-user-config.md`。
 ## 本地端口占用提示
 如需运行 BIND 模块端到端测试且宿主机 53 端口已占用，可通过环境变量 `HOST_DNS_PORT`（默认 1053）指定对外映射端口，例如 `HOST_DNS_PORT=12053 ./scripts/00_e2e_test.sh`。
--- a/build/README.md
+++ b/build/README.md
@ -0,0 +1,150 @@
 # ARGUS 统一构建脚本使用说明（build/build_images.sh）
 本目录提供单一入口脚本 `build/build_images.sh`，覆盖常见三类场景：
 - 系统集成测试（src/sys/tests）
 - Swarm 系统集成测试（src/sys/swarm_tests）
 - 构建离线安装包（deployment_new：Server/Client‑GPU）
 文档还说明 UID/GID 取值规则、镜像 tag 策略、常用参数与重试机制。
 ## 环境前置
 - Docker Engine ≥ 20.10（建议 ≥ 23.x/24.x）
 - Docker Compose v2（`docker compose` 子命令）
 - 可选：内网构建镜像源（`--intranet`）
 ## UID/GID 规则（用于容器内用户/卷属主）
 - 非 pkg 构建（core/master/metric/web/alert/sys/gpu_bundle/cpu_bundle）：
  - 读取 `configs/build_user.local.conf` → `configs/build_user.conf`；
  - 可被环境变量覆盖：`ARGUS_BUILD_UID`、`ARGUS_BUILD_GID`；
 - pkg 构建（`--only server_pkg`、`--only client_pkg`）：
  - 读取 `configs/build_user.pkg.conf`（优先）→ `build_user.local.conf` → `build_user.conf`；
  - 可被环境变量覆盖；
 - CPU bundle 明确走“非 pkg”链（不读取 `build_user.pkg.conf`）。
 - 说明：仅依赖 UID/GID 的 Docker 层会因参数变动而自动重建，不同构建剖面不会“打错包”。
 ## 镜像 tag 策略
 - 非 pkg 构建：默认输出 `:latest`。
 - `--only server_pkg`：所有镜像直接输出为 `:<VERSION>`（不覆盖 `:latest`）。
 - `--only client_pkg`：GPU bundle 仅输出 `:<VERSION>`（不覆盖 `:latest`）。
 - `--only cpu_bundle`：默认仅输出 `:<VERSION>`；可加 `--tag-latest` 同时打 `:latest` 以兼容 swarm_tests 默认 compose。
 ## 不加 --only 的默认构建目标
 不指定 `--only` 时，脚本会构建“基础镜像集合”（不含 bundle 与安装包）：
 - core：`argus-elasticsearch:latest`、`argus-kibana:latest`、`argus-bind9:latest`
 - master：`argus-master:latest`（非 offline）
 - metric：`argus-metric-ftp:latest`、`argus-metric-prometheus:latest`、`argus-metric-grafana:latest`
 - web：`argus-web-frontend:latest`、`argus-web-proxy:latest`
 - alert：`argus-alertmanager:latest`
 - sys：`argus-sys-node:latest`、`argus-sys-metric-test-node:latest`、`argus-sys-metric-test-gpu-node:latest`
 说明：默认 tag 为 `:latest`；UID/GID 走“非 pkg”链（`build_user.local.conf → build_user.conf`，可被环境变量覆盖）。
 ## 通用参数
 - `--intranet`：使用内网构建参数（各 Dockerfile 中按需启用）。
 - `--no-cache`：禁用 Docker 层缓存。
 - `--only <list>`：逗号分隔目标，例：`--only core,master,metric,web,alert`。
 - `--version YYMMDD`：bundle/pkg 的日期标签（必填于 cpu_bundle/gpu_bundle/server_pkg/client_pkg）。
 - `--client-semver X.Y.Z`：all‑in‑one‑full 客户端语义化版本（可选）。
 - `--cuda VER`：GPU bundle CUDA 基镜版本（默认 12.2.2）。
 - `--tag-latest`：CPU bundle 构建时同时打 `:latest`。
 ## 自动重试
 - 构建单镜像失败会自动重试（默认 3 次，间隔 5s）。
 - 最后一次自动使用 `DOCKER_BUILDKIT=0` 再试，缓解 “failed to receive status: context canceled”。
 - 可调：`ARGUS_BUILD_RETRIES`、`ARGUS_BUILD_RETRY_DELAY` 环境变量。
 ---
 ## 场景一：系统集成测试（src/sys/tests）
 构建用于系统级端到端测试的镜像（默认 `:latest`）。
 示例：
 ```
 # 构建核心与周边
 ./build/build_images.sh --only core,master,metric,web,alert,sys
 ```
 产出：
 - 本地镜像：`argus-elasticsearch:latest`、`argus-kibana:latest`、`argus-master:latest`、`argus-metric-ftp:latest`、`argus-metric-prometheus:latest`、`argus-metric-grafana:latest`、`argus-alertmanager:latest`、`argus-web-frontend:latest`、`argus-web-proxy:latest`、`argus-sys-node:latest` 等。
 说明：
 - UID/GID 读取 `build_user.local.conf → build_user.conf`（或环境变量覆盖）。
 - sys/tests 的执行见 `src/sys/tests/README.md`。
 ---
 ## 场景二：Swarm 系统集成测试（src/sys/swarm_tests）
 需要服务端镜像 + CPU 节点 bundle 镜像。
 步骤：
 1) 构建服务端镜像（默认 `:latest`）
 ```
 ./build/build_images.sh --only core,master,metric,web,alert
 ```
 2) 构建 CPU bundle（直接 FROM ubuntu:22.04）
 ```
 # 仅版本 tag 输出
 ./build/build_images.sh --only cpu_bundle --version 20251114
 # 若要兼容 swarm_tests 默认 latest：
 ./build/build_images.sh --only cpu_bundle --version 20251114 --tag-latest
 ```
 3) 运行 Swarm 测试
 ```
 cd src/sys/swarm_tests
 # 如未打 latest，可先指定：
 export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:20251114
 ./scripts/01_server_up.sh
 ./scripts/02_wait_ready.sh
 ./scripts/03_nodes_up.sh
 ./scripts/04_metric_verify.sh   # 验证 Prometheus/Grafana/nodes.json 与日志通路
 ./scripts/99_down.sh            # 结束
 ```
 产出：
 - 本地镜像：`argus-*:latest` 与 `argus-sys-metric-test-node-bundle:20251114`（或 latest）。
 - `swarm_tests/private-*`：运行态持久化文件。
 说明：
 - CPU bundle 构建用户走“非 pkg”链（local.conf → conf）。
 - `04_metric_verify.sh` 已内置 Fluent Bit 启动与配置修正逻辑，偶发未就绪可重跑一次即通过。
 ---
 ## 场景三：构建离线安装包（deployment_new）
 Server 与 Client‑GPU 安装包均采用“版本直出”，只输出 `:<VERSION>` 标签，不改动 `:latest`。
 1) Server 包
 ```
 ./build/build_images.sh --only server_pkg --version 20251114
 ```
 产出：
 - 本地镜像：`argus-<模块>:20251114`（不触碰 latest）。
 - 安装包：`deployment_new/artifact/server/20251114/` 与 `server_20251114.tar.gz`
 - 包内包含：逐镜像 tar.gz、compose/.env.example、scripts（config/install/selfcheck/diagnose 等）、docs、manifest/checksums。
 2) Client‑GPU 包
 ```
 # 同步构建 GPU bundle（仅 :<VERSION>，不触碰 latest），并生成客户端包
 ./build/build_images.sh --only client_pkg --version 20251114 \\
  --client-semver 1.44.0 --cuda 12.2.2
 ```
 产出：
 - 本地镜像：`argus-sys-metric-test-node-bundle-gpu:20251114`
 - 安装包：`deployment_new/artifact/client_gpu/20251114/` 与 `client_gpu_20251114.tar.gz`
 - 包内包含：GPU bundle 镜像 tar.gz、busybox.tar、compose/.env.example、scripts（config/install/uninstall）、docs、manifest/checksums。
 说明：
 - pkg 构建使用 `configs/build_user.pkg.conf` 的 UID/GID（可被环境覆盖）。
 - 包内 `.env.example` 的 `PKG_VERSION=<VERSION>` 与镜像 tag 严格一致。
 ---
 ## 常见问题（FAQ）
 - 构建报 `failed to receive status: context canceled`？
  - 已内置单镜像多次重试，最后一次禁用 BuildKit；建议加 `--intranet` 与 `--no-cache` 重试，或 `docker builder prune -f` 后再试。
 - 先跑非 pkg（latest），再跑 pkg（version）会不会“打错包”？
  - 不会。涉及 UID/GID 的层因参数变化会重建，其它层按缓存命中复用，最终 pkg 产物的属主与运行账户按 `build_user.pkg.conf` 生效。
 - swarm_tests 默认拉取 `:latest`，我只构建了 `:<VERSION>` 的 CPU bundle 怎么办？
  - 在运行前 `export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:<VERSION>`，或在构建时加 `--tag-latest`。
 ---
 如需进一步自动化（例如生成 BUILD_SUMMARY.txt 汇总镜像 digest 与构建参数），可在 pkg 产出阶段追加，我可以按需补齐。
--- a/build/build_images.sh
+++ b/build/build_images.sh
@ -0,0 +1,870 @@
 #!/usr/bin/env bash
 set -euo pipefail
 show_help() {
  cat <<'EOF'
 ARGUS Unified Build System - Image Build Tool
 Usage: $0 [OPTIONS]
 Options:
  --intranet        Use intranet mirror for log/bind builds
  --master-offline  Build master offline image (requires src/master/offline_wheels.tar.gz)
  --metric          Build metric module images (ftp, prometheus, grafana, test nodes)
  --no-cache        Build all images without using Docker layer cache
  --only LIST       Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
  --version DATE    Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
  --client-semver X.Y.Z  Override client semver used in all-in-one-full artifact (optional)
  --cuda VER        CUDA runtime version for NVIDIA base (default: 12.2.2)
  --tag-latest      Also tag bundle image as :latest (for cpu_bundle only; default off)
  -h, --help        Show this help message
 Examples:
  $0                             # Build with default sources
  $0 --intranet                  # Build with intranet mirror
  $0 --master-offline            # Additionally build argus-master:offline
  $0 --metric                    # Additionally build metric module images
  $0 --intranet --master-offline --metric
 EOF
 }
 use_intranet=false
 build_core=true
 build_master=true
 build_master_offline=false
 build_metric=true
 build_web=true
 build_alert=true
 build_sys=true
 build_gpu_bundle=false
 build_cpu_bundle=false
 build_server_pkg=false
 build_client_pkg=false
 no_cache=false
 bundle_date=""
 client_semver=""
 cuda_ver="12.2.2"
 DEFAULT_IMAGE_TAG="latest"
 tag_latest=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --intranet)
      use_intranet=true
      shift
      ;;
    --master)
      build_master=true
      shift
      ;;
    --master-offline)
      build_master=true
      build_master_offline=true
      shift
      ;;
    --metric)
      build_metric=true
      shift
      ;;
    --no-cache)
      no_cache=true
      shift
      ;;
    --only)
      if [[ -z ${2:-} ]]; then
        echo "--only requires a target list" >&2; exit 1
      fi
      sel="$2"; shift 2
      # reset all, then enable selected
      build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
      IFS=',' read -ra parts <<< "$sel"
      for p in "${parts[@]}"; do
        case "$p" in
          core) build_core=true ;;
          master) build_master=true ;;
          metric) build_metric=true ;;
          web) build_web=true ;;
          alert) build_alert=true ;;
          sys) build_sys=true ;;
          gpu_bundle) build_gpu_bundle=true ;;
          cpu_bundle) build_cpu_bundle=true ;;
          server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
          client_pkg) build_client_pkg=true ;;
          all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
          *) echo "Unknown --only target: $p" >&2; exit 1 ;;
        esac
      done
      ;;
    --version)
      if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
      bundle_date="$2"; shift 2
      ;;
    --client-semver)
      if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
      client_semver="$2"; shift 2
      ;;
    --cuda)
      if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
      cuda_ver="$2"; shift 2
      ;;
    --tag-latest)
      tag_latest=true
      shift
      ;;
    -h|--help)
      show_help
      exit 0
      ;;
    *)
      echo "Unknown option: $1" >&2
      show_help
      exit 1
      ;;
  esac
 done
 root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 . "$root/scripts/common/build_user.sh"
 declare -a build_args=()
 if [[ "$use_intranet" == true ]]; then
  build_args+=("--build-arg" "USE_INTRANET=true")
 fi
 cd "$root"
 # Set default image tag policy before building
 if [[ "$build_server_pkg" == true ]]; then
  DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
 fi
 # Select build user profile for pkg vs default
 if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
  export ARGUS_BUILD_PROFILE=pkg
 fi
 load_build_user
 build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
 if [[ "$no_cache" == true ]]; then
  build_args+=("--no-cache")
 fi
 master_root="$root/src/master"
 master_offline_tar="$master_root/offline_wheels.tar.gz"
 master_offline_dir="$master_root/offline_wheels"
 if [[ "$build_master_offline" == true ]]; then
  if [[ ! -f "$master_offline_tar" ]]; then
    echo "❌ offline wheels tar not found: $master_offline_tar" >&2
    echo "   请提前准备好 offline_wheels.tar.gz 后再执行 --master-offline" >&2
    exit 1
  fi
  echo "📦 Preparing offline wheels for master (extracting $master_offline_tar)"
  rm -rf "$master_offline_dir"
  mkdir -p "$master_offline_dir"
  tar -xzf "$master_offline_tar" -C "$master_root"
  has_wheel=$(find "$master_offline_dir" -maxdepth 1 -type f -name '*.whl' -print -quit)
  if [[ -z "$has_wheel" ]]; then
    echo "❌ offline_wheels extraction failed或无 wheel: $master_offline_dir" >&2
    exit 1
  fi
 fi
 echo "======================================="
 echo "ARGUS Unified Build System"
 echo "======================================="
 if [[ "$use_intranet" == true ]]; then
  echo "🌐 Mode: Intranet (Using internal mirror: 10.68.64.1)"
 else
  echo "🌐 Mode: Public (Using default package sources)"
 fi
 echo "👤 Build user UID:GID -> ${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}"
 echo "📁 Build context: $root"
 echo ""
 build_image() {
  local image_name=$1
  local dockerfile_path=$2
  local tag=$3
  local context="."
  shift 3
  if [[ $# -gt 0 ]]; then
    context=$1
    shift
  fi
  local extra_args=("$@")
  echo "🔄 Building $image_name image..."
  echo "   Dockerfile: $dockerfile_path"
  echo "   Tag: $tag"
  echo "   Context: $context"
  local tries=${ARGUS_BUILD_RETRIES:-3}
  local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
  local attempt=1
  while (( attempt <= tries )); do
    local prefix=""
    if (( attempt == tries )); then
      # final attempt: disable BuildKit to avoid docker/dockerfile front-end pulls
      prefix="DOCKER_BUILDKIT=0"
      echo "   Attempt ${attempt}/${tries} (fallback: DOCKER_BUILDKIT=0)"
    else
      echo "   Attempt ${attempt}/${tries}"
    fi
    if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
      echo "✅ $image_name image built successfully"
      return 0
    fi
    echo "⚠️  Build failed for $image_name (attempt ${attempt}/${tries})."
    if (( attempt < tries )); then
      echo "   Retrying in ${delay}s..."
      sleep "$delay"
    fi
    attempt=$((attempt+1))
  done
  echo "❌ Failed to build $image_name image after ${tries} attempts"
  return 1
 }
 pull_base_image() {
  local image_ref=$1
  local attempts=${2:-3}
  local delay=${3:-5}
  # If the image already exists locally, skip pulling.
  if docker image inspect "$image_ref" >/dev/null 2>&1; then
    echo "   Local image present; skip pull: $image_ref"
    return 0
  fi
  for ((i=1; i<=attempts; i++)); do
    echo "   Pulling base image ($i/$attempts): $image_ref"
    if docker pull "$image_ref" >/dev/null; then
      echo "   Base image ready: $image_ref"
      return 0
    fi
    echo "   Pull failed: $image_ref"
    if (( i < attempts )); then
      echo "   Retrying in ${delay}s..."
      sleep "$delay"
    fi
  done
  echo "❌ Unable to pull base image after ${attempts} attempts: $image_ref"
  return 1
 }
 images_built=()
 build_failed=false
 build_gpu_bundle_image() {
  local date_tag="$1"      # e.g. 20251112
  local cuda_ver_local="$2" # e.g. 12.2.2
  local client_ver="$3"     # semver like 1.43.0
  if [[ -z "$date_tag" ]]; then
    echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
    return 1
  fi
  # sanitize cuda version (trim trailing dots like '12.2.')
  while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
  # Resolve effective CUDA base tag
  local resolve_cuda_base_tag
  resolve_cuda_base_tag() {
    local want="$1" # can be 12, 12.2 or 12.2.2
    local major minor patch
    if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
      echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
    elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
      # try to find best local patch for major.minor
      local best
      best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
             grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
             sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
             sort -V | tail -n1 || true)
      if [[ -n "$best" ]]; then
        echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
      fi
      # fallback patch if none local
      echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
    elif [[ "$want" =~ ^([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"
      # try to find best local for this major
      local best
      best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
             grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
             sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
             sort -V | tail -n1 || true)
      if [[ -n "$best" ]]; then
        echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
      fi
      echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
    else
      # invalid format, fallback to default
      echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
    fi
  }
  local base_image
  base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
  echo
  echo "🔧 Preparing one-click GPU bundle build"
  echo "   CUDA runtime base: ${base_image}"
  echo "   Bundle tag       : ${date_tag}"
  # 1) Ensure NVIDIA base image (skip pull if local)
  if ! pull_base_image "$base_image"; then
    # try once more with default if resolution failed
    if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
      return 1
    else
      base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
    fi
  fi
  # 2) Build latest argus-agent from source
  echo "\n🛠  Building argus-agent from src/agent"
  pushd "$root/src/agent" >/dev/null
  if ! bash scripts/build_binary.sh; then
    echo "❌ argus-agent build failed" >&2
    popd >/dev/null
    return 1
  fi
  if [[ ! -f "dist/argus-agent" ]]; then
    echo "❌ argus-agent binary missing after build" >&2
    popd >/dev/null
    return 1
  fi
  popd >/dev/null
  # 3) Inject agent into all-in-one-full plugin and package artifact
  local aio_root="$root/src/metric/client-plugins/all-in-one-full"
  local agent_bin_src="$root/src/agent/dist/argus-agent"
  local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
  echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
  cp -f "$agent_bin_src" "$agent_bin_dst"
  chmod +x "$agent_bin_dst" || true
  pushd "$aio_root" >/dev/null
  local prev_version
  prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
  local use_version="$prev_version"
  if [[ -n "$client_semver" ]]; then
    echo "${client_semver}" > config/VERSION
    use_version="$client_semver"
  fi
  echo "   Packaging all-in-one-full artifact version: $use_version"
  if ! bash scripts/package_artifact.sh --force; then
    echo "❌ package_artifact.sh failed" >&2
    # restore VERSION if changed
    if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
    popd >/dev/null
    return 1
  fi
  local artifact_dir="$aio_root/artifact/$use_version"
  local artifact_tar
  artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  if [[ -z "$artifact_tar" ]]; then
    echo "   No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
    local owner="$(id -u):$(id -g)"
    if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
      echo "❌ publish_artifact.sh failed" >&2
      if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
      popd >/dev/null
      return 1
    fi
    artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  fi
  if [[ -z "$artifact_tar" ]]; then
    echo "❌ artifact tar not found under $artifact_dir" >&2
    if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
    popd >/dev/null
    return 1
  fi
  # restore VERSION if changed (keep filesystem clean)
  if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
  popd >/dev/null
  # 4) Stage docker build context
  local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
  echo "\n🧰 Staging docker build context: $bundle_ctx"
  rm -rf "$bundle_ctx"
  mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
  cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
  cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
  cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
  # bundle tar
  cp "$artifact_tar" "$bundle_ctx/bundle/"
  # offline fluent-bit assets (optional but useful)
  if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
    cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
  fi
  if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
    cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
  fi
  if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
    cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
  fi
  # 5) Build the final bundle image (directly from NVIDIA base)
  local image_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
  echo "\n🔄 Building GPU Bundle image"
  if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
      --build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
      --build-arg CLIENT_VER="$use_version" \
      --build-arg BUNDLE_DATE="$date_tag"; then
    images_built+=("$image_tag")
    # In non-pkg mode, also tag latest for convenience
    if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
      docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu:latest >/dev/null 2>&1 || true
    fi
    return 0
  else
    return 1
  fi
 }
 # Tag helper: ensure :<date_tag> exists for a list of repos
 ensure_version_tags() {
  local date_tag="$1"; shift
  local repos=("$@")
  for repo in "${repos[@]}"; do
    if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
      :
    elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
      docker tag "$repo:latest" "$repo:$date_tag" || true
    else
      echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
      return 1
    fi
  done
  return 0
 }
 # Build server package after images are built
 build_server_pkg_bundle() {
  local date_tag="$1"
  if [[ -z "$date_tag" ]]; then
    echo "❌ server_pkg requires --version YYMMDD" >&2
    return 1
  fi
  local repos=(
    argus-bind9 argus-master argus-elasticsearch argus-kibana \
    argus-metric-ftp argus-metric-prometheus argus-metric-grafana \
    argus-alertmanager argus-web-frontend argus-web-proxy
  )
  echo "\n🔖 Verifying server images with :$date_tag and collecting digests"
  for repo in "${repos[@]}"; do
    if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
      echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
      return 1
    fi
  done
  # Optional: show digests
  for repo in "${repos[@]}"; do
    local digest
    digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
    printf '   • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
  done
  echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
  if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
    echo "❌ make_server_package.sh failed" >&2
    return 1
  fi
  return 0
 }
 # Build client package: ensure gpu bundle image exists, then package client_gpu
 build_client_pkg_bundle() {
  local date_tag="$1"
  local semver="$2"
  local cuda="$3"
  if [[ -z "$date_tag" ]]; then
    echo "❌ client_pkg requires --version YYMMDD" >&2
    return 1
  fi
  local bundle_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
  if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
    echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
    ARGUS_PKG_BUILD=1
    export ARGUS_PKG_BUILD
    if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
      return 1
    fi
  else
    echo "\n✅ Using existing GPU bundle image: $bundle_tag"
  fi
  echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
  if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
    echo "❌ make_client_gpu_package.sh failed" >&2
    return 1
  fi
  return 0
 }
 # Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
 build_cpu_bundle_image() {
  local date_tag="$1"         # e.g. 20251113
  local client_ver_in="$2"    # semver like 1.43.0 (optional)
  local want_tag_latest="$3"   # true/false
  if [[ -z "$date_tag" ]]; then
    echo "❌ cpu_bundle requires --version YYMMDD" >&2
    return 1
  fi
  echo "\n🔧 Preparing one-click CPU bundle build"
  echo "   Base: ubuntu:22.04"
  echo "   Bundle tag: ${date_tag}"
  # 1) Build latest argus-agent from source
  echo "\n🛠  Building argus-agent from src/agent"
  pushd "$root/src/agent" >/dev/null
  if ! bash scripts/build_binary.sh; then
    echo "❌ argus-agent build failed" >&2
    popd >/dev/null
    return 1
  fi
  if [[ ! -f "dist/argus-agent" ]]; then
    echo "❌ argus-agent binary missing after build" >&2
    popd >/dev/null
    return 1
  fi
  popd >/dev/null
  # 2) Inject agent into all-in-one-full plugin and package artifact
  local aio_root="$root/src/metric/client-plugins/all-in-one-full"
  local agent_bin_src="$root/src/agent/dist/argus-agent"
  local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
  echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
  cp -f "$agent_bin_src" "$agent_bin_dst"
  chmod +x "$agent_bin_dst" || true
  pushd "$aio_root" >/dev/null
  local prev_version use_version
  prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
  use_version="$prev_version"
  if [[ -n "$client_ver_in" ]]; then
    echo "$client_ver_in" > config/VERSION
    use_version="$client_ver_in"
  fi
  echo "   Packaging all-in-one-full artifact: version=$use_version"
  if ! bash scripts/package_artifact.sh --force; then
    echo "❌ package_artifact.sh failed" >&2
    [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
    popd >/dev/null
    return 1
  fi
  local artifact_dir="$aio_root/artifact/$use_version"
  local artifact_tar
  artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  if [[ -z "$artifact_tar" ]]; then
    echo "   No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
    local owner="$(id -u):$(id -g)"
    if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
      echo "❌ publish_artifact.sh failed" >&2
      [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
      popd >/dev/null
      return 1
    fi
    artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  fi
  [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
  popd >/dev/null
  # 3) Stage docker build context
  local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
  echo "\n🧰 Staging docker build context: $bundle_ctx"
  rm -rf "$bundle_ctx"
  mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
  cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
  cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
  cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
  # bundle tar
  cp "$artifact_tar" "$bundle_ctx/bundle/"
  # offline fluent-bit assets
  if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
    cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
  fi
  if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
    cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
  fi
  if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
    cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
  fi
  # 4) Build final bundle image
  local image_tag="argus-sys-metric-test-node-bundle:${date_tag}"
  echo "\n🔄 Building CPU Bundle image"
  if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
    images_built+=("$image_tag")
    if [[ "$want_tag_latest" == "true" ]]; then
      docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
    fi
    return 0
  else
    return 1
  fi
 }
 if [[ "$build_core" == true ]]; then
  if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-elasticsearch:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
  echo ""
  if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-kibana:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
  echo ""
  if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-bind9:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
 fi
 echo ""
 if [[ "$build_master" == true ]]; then
  echo ""
  echo "🔄 Building Master image..."
  pushd "$master_root" >/dev/null
  master_args=("--tag" "argus-master:${DEFAULT_IMAGE_TAG}")
  if [[ "$use_intranet" == true ]]; then
    master_args+=("--intranet")
  fi
  if [[ "$build_master_offline" == true ]]; then
    master_args+=("--offline")
  fi
  if [[ "$no_cache" == true ]]; then
    master_args+=("--no-cache")
  fi
  if ./scripts/build_images.sh "${master_args[@]}"; then
    if [[ "$build_master_offline" == true ]]; then
      images_built+=("argus-master:offline")
    else
      images_built+=("argus-master:${DEFAULT_IMAGE_TAG}")
    fi
  else
    build_failed=true
  fi
  popd >/dev/null
 fi
 if [[ "$build_metric" == true ]]; then
  echo ""
  echo "Building Metric module images..."
  metric_base_images=(
    "ubuntu:22.04"
    "ubuntu/prometheus:3-24.04_stable"
    "grafana/grafana:11.1.0"
  )
  for base_image in "${metric_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  metric_builds=(
    "Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build"
    "Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
    "Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
  )
  for build_spec in "${metric_builds[@]}"; do
    IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
    if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
      images_built+=("$image_tag")
    else
      build_failed=true
    fi
    echo ""
  done
 fi
 # =======================================
 # Sys (system tests) node images
 # =======================================
 if [[ "$build_sys" == true ]]; then
  echo ""
  echo "Building Sys node images..."
  sys_base_images=(
    "ubuntu:22.04"
    "nvidia/cuda:12.2.2-runtime-ubuntu22.04"
  )
  for base_image in "${sys_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  sys_builds=(
    "Sys Node|src/sys/build/node/Dockerfile|argus-sys-node:latest|."
    "Sys Metric Test Node|src/sys/build/test-node/Dockerfile|argus-sys-metric-test-node:latest|."
    "Sys Metric Test GPU Node|src/sys/build/test-gpu-node/Dockerfile|argus-sys-metric-test-gpu-node:latest|."
  )
  for build_spec in "${sys_builds[@]}"; do
    IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
    if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
      images_built+=("$image_tag")
    else
      build_failed=true
    fi
    echo ""
  done
 fi
 # =======================================
 # Web & Alert module images
 # =======================================
 if [[ "$build_web" == true || "$build_alert" == true ]]; then
  echo ""
  echo "Building Web and Alert module images..."
  # Pre-pull commonly used base images for stability
  web_alert_base_images=(
    "node:20"
    "ubuntu:24.04"
  )
  for base_image in "${web_alert_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  if [[ "$build_web" == true ]]; then
    web_builds=(
      "Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:${DEFAULT_IMAGE_TAG}|."
      "Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:${DEFAULT_IMAGE_TAG}|."
    )
    for build_spec in "${web_builds[@]}"; do
      IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
      if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
        images_built+=("$image_tag")
      else
        build_failed=true
      fi
      echo ""
    done
  fi
  if [[ "$build_alert" == true ]]; then
    alert_builds=(
      "Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:${DEFAULT_IMAGE_TAG}|."
    )
    for build_spec in "${alert_builds[@]}"; do
      IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
      if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
        images_built+=("$image_tag")
      else
        build_failed=true
      fi
      echo ""
    done
  fi
 fi
 # =======================================
 # One-click GPU bundle (direct NVIDIA base)
 # =======================================
 if [[ "$build_gpu_bundle" == true ]]; then
  echo ""
  echo "Building one-click GPU bundle image..."
  if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
    build_failed=true
  fi
 fi
 # =======================================
 # One-click CPU bundle (from ubuntu:22.04)
 # =======================================
 if [[ "$build_cpu_bundle" == true ]]; then
  echo ""
  echo "Building one-click CPU bundle image..."
  if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
    build_failed=true
  fi
 fi
 # =======================================
 # One-click Server/Client packaging
 # =======================================
 if [[ "$build_server_pkg" == true ]]; then
  echo ""
  echo "🧳 Building one-click Server package..."
  if ! build_server_pkg_bundle "${bundle_date}"; then
    build_failed=true
  fi
 fi
 if [[ "$build_client_pkg" == true ]]; then
  echo ""
  echo "🧳 Building one-click Client-GPU package..."
  if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
    build_failed=true
  fi
 fi
 echo "======================================="
 echo "📦 Build Summary"
 echo "======================================="
 if [[ ${#images_built[@]} -gt 0 ]]; then
  echo "✅ Successfully built images:"
  for image in "${images_built[@]}"; do
    echo "   • $image"
  done
 fi
 if [[ "$build_failed" == true ]]; then
  echo ""
  echo "❌ Some images failed to build. Please check the errors above."
  exit 1
 fi
 if [[ "$use_intranet" == true ]]; then
  echo ""
  echo "🌐 Built with intranet mirror configuration"
 fi
 if [[ "$build_master_offline" == true ]]; then
  echo ""
  echo "🧳 Master offline wheels 已解压到 $master_offline_dir"
 fi
 echo ""
 echo "🚀 Next steps:"
 echo "   ./build/save_images.sh --compress          # 导出镜像"
 echo "   cd src/master/tests && MASTER_IMAGE_TAG=argus-master:offline ./scripts/00_e2e_test.sh"
 echo ""
--- a/build/build_images_for_arm.sh
+++ b/build/build_images_for_arm.sh
@ -0,0 +1,935 @@
 #!/usr/bin/env bash
 set -euo pipefail
 export ARGUS_TARGET_ARCH="arm64"
 ARGUS_BUILDX_BUILDER="${ARGUS_BUILDX_BUILDER:-mybuilder}"
 # 自动加载 HTTP/HTTPS 代理配置（仅在变量未预先设置时）
 if [[ -z "${HTTP_PROXY:-}" && -z "${http_proxy:-}" ]]; then
  if [[ -f /home/yuyr/.source_http_proxy.sh ]]; then
    # shellcheck disable=SC1090
    source /home/yuyr/.source_http_proxy.sh || true
  fi
 fi
 # 自动准备并切换到指定的 buildx builder（用于 x86_64 上构建 ARM 镜像）
 if command -v docker >/dev/null 2>&1; then
  if docker buildx ls >/dev/null 2>&1; then
    # 若指定的 builder 不存在，则自动创建（带代理环境变量）
    if ! docker buildx ls | awk '{print $1}' | grep -qx "${ARGUS_BUILDX_BUILDER}"; then
      echo "🔧 Creating buildx builder '${ARGUS_BUILDX_BUILDER}' for ARM builds..."
      create_args=(create --name "${ARGUS_BUILDX_BUILDER}" --driver docker-container)
      if [[ -n "${HTTP_PROXY:-}" ]]; then
        create_args+=(--driver-opt "env.HTTP_PROXY=${HTTP_PROXY}" --driver-opt "env.http_proxy=${HTTP_PROXY}")
      fi
      if [[ -n "${HTTPS_PROXY:-}" ]]; then
        create_args+=(--driver-opt "env.HTTPS_PROXY=${HTTPS_PROXY}" --driver-opt "env.https_proxy=${HTTPS_PROXY}")
      fi
      if [[ -n "${NO_PROXY:-}" ]]; then
        create_args+=(--driver-opt "env.NO_PROXY=${NO_PROXY}" --driver-opt "env.no_proxy=${NO_PROXY}")
      fi
      docker buildx "${create_args[@]}" --bootstrap >/dev/null 2>&1 || true
    fi
    docker buildx use "${ARGUS_BUILDX_BUILDER}" >/dev/null 2>&1 || true
  fi
 fi
 show_help() {
  cat <<'EOF'
 ARGUS Unified Build System - Image Build Tool
 Usage: $0 [OPTIONS]
 Options:
  --intranet        Use intranet mirror for log/bind builds
  --master-offline  Build master offline image (requires src/master/offline_wheels.tar.gz)
  --metric          Build metric module images (ftp, prometheus, grafana, test nodes)
  --no-cache        Build all images without using Docker layer cache
  --only LIST       Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
  --version DATE    Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
  --client-semver X.Y.Z  Override client semver used in all-in-one-full artifact (optional)
  --cuda VER        CUDA runtime version for NVIDIA base (default: 12.2.2)
  --tag-latest      Also tag bundle image as :latest (for cpu_bundle only; default off)
  -h, --help        Show this help message
 Examples:
  $0                             # Build with default sources
  $0 --intranet                  # Build with intranet mirror
  $0 --master-offline            # Additionally build argus-master:offline
  $0 --metric                    # Additionally build metric module images
  $0 --intranet --master-offline --metric
 EOF
 }
 use_intranet=false
 build_core=true
 build_master=true
 build_master_offline=false
 build_metric=true
 build_web=true
 build_alert=true
 build_sys=true
 build_gpu_bundle=false
 build_cpu_bundle=false
 build_server_pkg=false
 build_client_pkg=false
 no_cache=false
 bundle_date=""
 client_semver=""
 cuda_ver="12.2.2"
 DEFAULT_IMAGE_TAG="latest"
 tag_latest=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --intranet)
      use_intranet=true
      shift
      ;;
    --master)
      build_master=true
      shift
      ;;
    --master-offline)
      build_master=true
      build_master_offline=true
      shift
      ;;
    --metric)
      build_metric=true
      shift
      ;;
    --no-cache)
      no_cache=true
      shift
      ;;
    --only)
      if [[ -z ${2:-} ]]; then
        echo "--only requires a target list" >&2; exit 1
      fi
      sel="$2"; shift 2
      # reset all, then enable selected
      build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
      IFS=',' read -ra parts <<< "$sel"
      for p in "${parts[@]}"; do
        case "$p" in
          core) build_core=true ;;
          master) build_master=true ;;
          metric) build_metric=true ;;
          web) build_web=true ;;
          alert) build_alert=true ;;
          sys) build_sys=true ;;
          gpu_bundle) build_gpu_bundle=true ;;
          cpu_bundle) build_cpu_bundle=true ;;
          server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
          client_pkg) build_client_pkg=true ;;
          all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
          *) echo "Unknown --only target: $p" >&2; exit 1 ;;
        esac
      done
      ;;
    --version)
      if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
      bundle_date="$2"; shift 2
      ;;
    --client-semver)
      if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
      client_semver="$2"; shift 2
      ;;
    --cuda)
      if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
      cuda_ver="$2"; shift 2
      ;;
    --tag-latest)
      tag_latest=true
      shift
      ;;
    -h|--help)
      show_help
      exit 0
      ;;
    *)
      echo "Unknown option: $1" >&2
      show_help
      exit 1
      ;;
  esac
 done
 root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 . "$root/scripts/common/build_user.sh"
 declare -a build_args=()
 if [[ "$use_intranet" == true ]]; then
  build_args+=("--build-arg" "USE_INTRANET=true")
 fi
 cd "$root"
 # Set default image tag policy before building
 if [[ "$build_server_pkg" == true ]]; then
  DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
 fi
 # Select build user profile for pkg vs default
 if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
  export ARGUS_BUILD_PROFILE=pkg
 fi
 load_build_user
 build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
 if [[ "$no_cache" == true ]]; then
  build_args+=("--no-cache")
 fi
 master_root="$root/src/master"
 master_offline_tar="$master_root/offline_wheels.tar.gz"
 master_offline_dir="$master_root/offline_wheels"
 if [[ "$build_master_offline" == true ]]; then
  if [[ ! -f "$master_offline_tar" ]]; then
    echo "❌ offline wheels tar not found: $master_offline_tar" >&2
    echo "   请提前准备好 offline_wheels.tar.gz 后再执行 --master-offline" >&2
    exit 1
  fi
  echo "📦 Preparing offline wheels for master (extracting $master_offline_tar)"
  rm -rf "$master_offline_dir"
  mkdir -p "$master_offline_dir"
  tar -xzf "$master_offline_tar" -C "$master_root"
  has_wheel=$(find "$master_offline_dir" -maxdepth 1 -type f -name '*.whl' -print -quit)
  if [[ -z "$has_wheel" ]]; then
    echo "❌ offline_wheels extraction failed或无 wheel: $master_offline_dir" >&2
    exit 1
  fi
  # ARM 构建下，offline 模式仍通过 Dockerfile 中的 USE_OFFLINE/USE_INTRANET 参数控制
  build_args+=("--build-arg" "USE_OFFLINE=1" "--build-arg" "USE_INTRANET=true")
 fi
 echo "======================================="
 echo "ARGUS Unified Build System"
 echo "======================================="
 if [[ "$use_intranet" == true ]]; then
  echo "🌐 Mode: Intranet (Using internal mirror: 10.68.64.1)"
 else
  echo "🌐 Mode: Public (Using default package sources)"
 fi
 echo "👤 Build user UID:GID -> ${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}"
 echo "📁 Build context: $root"
 echo ""
 build_image() {
  local image_name=$1
  local dockerfile_path=$2
  local tag=$3
  local context="."
  shift 3
  if [[ $# -gt 0 ]]; then
    context=$1
    shift
  fi
  local extra_args=("$@")
  # ARM 专用：如果存在带 .arm64 后缀的 Dockerfile，则优先使用
  local dockerfile_for_arch="$dockerfile_path"
  if [[ -n "${ARGUS_TARGET_ARCH:-}" && "$ARGUS_TARGET_ARCH" == "arm64" ]]; then
    if [[ -f "${dockerfile_path}.arm64" ]]; then
      dockerfile_for_arch="${dockerfile_path}.arm64"
    fi
  fi
  echo "🔄 Building $image_name image..."
  echo "   Dockerfile: $dockerfile_for_arch"
  echo "   Tag: $tag"
  echo "   Context: $context"
  local tries=${ARGUS_BUILD_RETRIES:-3}
  local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
  local attempt=1
  # 在非 ARM 主机上构建 ARM 镜像时，使用 buildx+--platform=linux/arm64
  local use_buildx=false
  if [[ "${ARGUS_TARGET_ARCH:-}" == "arm64" && "$(uname -m)" != "aarch64" ]]; then
    use_buildx=true
  fi
  while (( attempt <= tries )); do
    echo "   Attempt ${attempt}/${tries}"
    if [[ "$use_buildx" == true ]]; then
      # 通过 buildx 在 x86_64 等非 ARM 主机上构建 ARM64 镜像
      if docker buildx build \
           --builder "${ARGUS_BUILDX_BUILDER}" \
           --platform=linux/arm64 \
           "${build_args[@]}" "${extra_args[@]}" \
           -f "$dockerfile_for_arch" \
           -t "$tag" \
           "$context" \
           --load; then
        echo "✅ $image_name image built successfully (via buildx, platform=linux/arm64)"
        return 0
      fi
    else
      # 在 ARM 主机上直接使用 docker build（保留原有 DOCKER_BUILDKIT 回退行为）
      local prefix=""
      if (( attempt == tries )); then
        prefix="DOCKER_BUILDKIT=0"
        echo "   (final attempt with DOCKER_BUILDKIT=0)"
      fi
      if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_for_arch" -t "$tag" "$context"; then
        echo "✅ $image_name image built successfully"
        return 0
      fi
    fi
    echo "⚠️  Build failed for $image_name (attempt ${attempt}/${tries})."
    if (( attempt < tries )); then
      echo "   Retrying in ${delay}s..."
      sleep "$delay"
    fi
    attempt=$((attempt+1))
  done
  echo "❌ Failed to build $image_name image after ${tries} attempts"
  return 1
 }
 pull_base_image() {
  local image_ref=$1
  local attempts=${2:-3}
  local delay=${3:-5}
  # If the image already exists locally, skip pulling.
  if docker image inspect "$image_ref" >/dev/null 2>&1; then
    echo "   Local image present; skip pull: $image_ref"
    return 0
  fi
  for ((i=1; i<=attempts; i++)); do
    echo "   Pulling base image ($i/$attempts): $image_ref"
    if docker pull "$image_ref" >/dev/null; then
      echo "   Base image ready: $image_ref"
      return 0
    fi
    echo "   Pull failed: $image_ref"
    if (( i < attempts )); then
      echo "   Retrying in ${delay}s..."
      sleep "$delay"
    fi
  done
  echo "❌ Unable to pull base image after ${attempts} attempts: $image_ref"
  return 1
 }
 images_built=()
 build_failed=false
 build_gpu_bundle_image() {
  local date_tag="$1"      # e.g. 20251112
  local cuda_ver_local="$2" # e.g. 12.2.2
  local client_ver="$3"     # semver like 1.43.0
  if [[ -z "$date_tag" ]]; then
    echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
    return 1
  fi
  # sanitize cuda version (trim trailing dots like '12.2.')
  while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
  # Resolve effective CUDA base tag
  local resolve_cuda_base_tag
  resolve_cuda_base_tag() {
    local want="$1" # can be 12, 12.2 or 12.2.2
    local major minor patch
    if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
      echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
    elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
      # try to find best local patch for major.minor
      local best
      best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
             grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
             sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
             sort -V | tail -n1 || true)
      if [[ -n "$best" ]]; then
        echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
      fi
      # fallback patch if none local
      echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
    elif [[ "$want" =~ ^([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"
      # try to find best local for this major
      local best
      best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
             grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
             sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
             sort -V | tail -n1 || true)
      if [[ -n "$best" ]]; then
        echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
      fi
      echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
    else
      # invalid format, fallback to default
      echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
    fi
  }
  local base_image
  base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
  echo
  echo "🔧 Preparing one-click GPU bundle build"
  echo "   CUDA runtime base: ${base_image}"
  echo "   Bundle tag       : ${date_tag}"
  # 1) Ensure NVIDIA base image (skip pull if local)
  if ! pull_base_image "$base_image"; then
    # try once more with default if resolution failed
    if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
      return 1
    else
      base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
    fi
  fi
  # 2) Build latest argus-agent from source
  echo "\n🛠  Building argus-agent from src/agent"
  pushd "$root/src/agent" >/dev/null
  if ! bash scripts/build_binary.sh; then
    echo "❌ argus-agent build failed" >&2
    popd >/dev/null
    return 1
  fi
  if [[ ! -f "dist/argus-agent" ]]; then
    echo "❌ argus-agent binary missing after build" >&2
    popd >/dev/null
    return 1
  fi
  popd >/dev/null
  # 3) Inject agent into all-in-one-full plugin and package artifact
  local aio_root="$root/src/metric/client-plugins/all-in-one-full"
  local agent_bin_src="$root/src/agent/dist/argus-agent"
  local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
  echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
  cp -f "$agent_bin_src" "$agent_bin_dst"
  chmod +x "$agent_bin_dst" || true
  pushd "$aio_root" >/dev/null
  local prev_version
  prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
  local use_version="$prev_version"
  if [[ -n "$client_semver" ]]; then
    echo "${client_semver}" > config/VERSION
    use_version="$client_semver"
  fi
  echo "   Packaging all-in-one-full artifact version: $use_version"
  if ! bash scripts/package_artifact.sh --force; then
    echo "❌ package_artifact.sh failed" >&2
    # restore VERSION if changed
    if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
    popd >/dev/null
    return 1
  fi
  local artifact_dir="$aio_root/artifact/$use_version"
  local artifact_tar
  artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  if [[ -z "$artifact_tar" ]]; then
    echo "   No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
    local owner="$(id -u):$(id -g)"
    if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
      echo "❌ publish_artifact.sh failed" >&2
      if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
      popd >/dev/null
      return 1
    fi
    artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  fi
  if [[ -z "$artifact_tar" ]]; then
    echo "❌ artifact tar not found under $artifact_dir" >&2
    if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
    popd >/dev/null
    return 1
  fi
  # restore VERSION if changed (keep filesystem clean)
  if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
  popd >/dev/null
  # 4) Stage docker build context
  local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
  echo "\n🧰 Staging docker build context: $bundle_ctx"
  rm -rf "$bundle_ctx"
  mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
  cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
  cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
  cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
  # bundle tar
  cp "$artifact_tar" "$bundle_ctx/bundle/"
  # offline fluent-bit assets (optional but useful)
  if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
    cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
  fi
  if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
    cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
  fi
  if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
    cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
  fi
  # 5) Build the final bundle image (directly from NVIDIA base)
  local image_tag="argus-sys-metric-test-node-bundle-gpu-arm64:${date_tag}"
  echo "\n🔄 Building GPU Bundle image"
  if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
      --build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
      --build-arg CLIENT_VER="$use_version" \
      --build-arg BUNDLE_DATE="$date_tag"; then
    images_built+=("$image_tag")
    # In non-pkg mode, also tag latest for convenience
    if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
      docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu-arm64:latest >/dev/null 2>&1 || true
    fi
    return 0
  else
    return 1
  fi
 }
 # Tag helper: ensure :<date_tag> exists for a list of repos
 ensure_version_tags() {
  local date_tag="$1"; shift
  local repos=("$@")
  for repo in "${repos[@]}"; do
    if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
      :
    elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
      docker tag "$repo:latest" "$repo:$date_tag" || true
    else
      echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
      return 1
    fi
  done
  return 0
 }
 # Build server package after images are built
 build_server_pkg_bundle() {
  local date_tag="$1"
  if [[ -z "$date_tag" ]]; then
    echo "❌ server_pkg requires --version YYMMDD" >&2
    return 1
  fi
  local repos=(
    argus-bind9-arm64 argus-master-arm64 argus-elasticsearch-arm64 argus-kibana-arm64 \
    argus-metric-ftp-arm64 argus-metric-prometheus-arm64 argus-metric-grafana-arm64 \
    argus-alertmanager-arm64 argus-web-frontend-arm64 argus-web-proxy-arm64
  )
  echo "\n🔖 Verifying server images with :$date_tag and collecting digests"
  for repo in "${repos[@]}"; do
    if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
      echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
      return 1
    fi
  done
  # Optional: show digests
  for repo in "${repos[@]}"; do
    local digest
    digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
    printf '   • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
  done
  echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
  if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
    echo "❌ make_server_package.sh failed" >&2
    return 1
  fi
  return 0
 }
 # Build client package: ensure gpu bundle image exists, then package client_gpu
 build_client_pkg_bundle() {
  local date_tag="$1"
  local semver="$2"
  local cuda="$3"
  if [[ -z "$date_tag" ]]; then
    echo "❌ client_pkg requires --version YYMMDD" >&2
    return 1
  fi
  local bundle_tag="argus-sys-metric-test-node-bundle-gpu-arm64:${date_tag}"
  if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
    echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
    ARGUS_PKG_BUILD=1
    export ARGUS_PKG_BUILD
    if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
      return 1
    fi
  else
    echo "\n✅ Using existing GPU bundle image: $bundle_tag"
  fi
  echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
  if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
    echo "❌ make_client_gpu_package.sh failed" >&2
    return 1
  fi
  return 0
 }
 # Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
 build_cpu_bundle_image() {
  local date_tag="$1"         # e.g. 20251113
  local client_ver_in="$2"    # semver like 1.43.0 (optional)
  local want_tag_latest="$3"   # true/false
  if [[ -z "$date_tag" ]]; then
    echo "❌ cpu_bundle requires --version YYMMDD" >&2
    return 1
  fi
  echo "\n🔧 Preparing one-click CPU bundle build"
  echo "   Base: ubuntu:22.04"
  echo "   Bundle tag: ${date_tag}"
  # 1) Build latest argus-agent from source
  echo "\n🛠  Building argus-agent from src/agent"
  pushd "$root/src/agent" >/dev/null
  if ! bash scripts/build_binary.sh; then
    echo "❌ argus-agent build failed" >&2
    popd >/dev/null
    return 1
  fi
  if [[ ! -f "dist/argus-agent" ]]; then
    echo "❌ argus-agent binary missing after build" >&2
    popd >/dev/null
    return 1
  fi
  popd >/dev/null
  # 2) Inject agent into all-in-one-full plugin and package artifact
  local aio_root="$root/src/metric/client-plugins/all-in-one-full"
  local agent_bin_src="$root/src/agent/dist/argus-agent"
  local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
  echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
  cp -f "$agent_bin_src" "$agent_bin_dst"
  chmod +x "$agent_bin_dst" || true
  pushd "$aio_root" >/dev/null
  local prev_version use_version
  prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
  use_version="$prev_version"
  if [[ -n "$client_ver_in" ]]; then
    echo "$client_ver_in" > config/VERSION
    use_version="$client_ver_in"
  fi
  echo "   Packaging all-in-one-full artifact: version=$use_version"
  if ! bash scripts/package_artifact.sh --force; then
    echo "❌ package_artifact.sh failed" >&2
    [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
    popd >/dev/null
    return 1
  fi
  local artifact_dir="$aio_root/artifact/$use_version"
  local artifact_tar
  artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  if [[ -z "$artifact_tar" ]]; then
    echo "   No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
    local owner="$(id -u):$(id -g)"
    if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
      echo "❌ publish_artifact.sh failed" >&2
      [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
      popd >/dev/null
      return 1
    fi
    artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  fi
  [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
  popd >/dev/null
  # 3) Stage docker build context
  local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
  echo "\n🧰 Staging docker build context: $bundle_ctx"
  rm -rf "$bundle_ctx"
  mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
  cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
  cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
  cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
  # bundle tar
  cp "$artifact_tar" "$bundle_ctx/bundle/"
  # offline fluent-bit assets
  if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
    cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
  fi
  if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
    cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
  fi
  if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
    cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
  fi
  # 4) Build final bundle image
  local image_tag="argus-sys-metric-test-node-bundle-arm64:${date_tag}"
  echo "\n🔄 Building CPU Bundle image"
  if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
    images_built+=("$image_tag")
    # 为兼容现有 compose/部署，额外打无后缀别名
    docker tag "$image_tag" "argus-sys-metric-test-node-bundle:${date_tag}" >/dev/null 2>&1 || true
    if [[ "$want_tag_latest" == "true" ]]; then
      docker tag "$image_tag" argus-sys-metric-test-node-bundle-arm64:latest >/dev/null 2>&1 || true
      docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
    fi
    return 0
  else
    return 1
  fi
 }
 if [[ "$build_core" == true ]]; then
  if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch-arm64:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-elasticsearch-arm64:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
  echo ""
  if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana-arm64:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-kibana-arm64:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
  echo ""
  if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9-arm64:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-bind9-arm64:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
 fi
 echo ""
 if [[ "$build_master" == true ]]; then
  echo ""
  echo "🔄 Building Master image..."
  # 复用通用 build_image 函数，通过 buildx 构建 ARM64 master 镜像
  if build_image "Master" "src/master/Dockerfile" "argus-master-arm64:${DEFAULT_IMAGE_TAG}" "."; then
    images_built+=("argus-master-arm64:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
 fi
 if [[ "$build_metric" == true ]]; then
  echo ""
  echo "Building Metric module images..."
  metric_base_images=(
    "ubuntu:22.04"
    "prom/prometheus:v3.5.0"
    "grafana/grafana:11.1.0"
  )
  for base_image in "${metric_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  metric_builds=(
    "Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp-arm64:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build"
    "Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus-arm64:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
    "Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana-arm64:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
    "Metric Prometheus Targets Updater|src/metric/prometheus/build/Dockerfile.targets-updater|argus-metric-prometheus-targets-updater-arm64:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
  )
  for build_spec in "${metric_builds[@]}"; do
    IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
    if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
      images_built+=("$image_tag")
    else
      build_failed=true
    fi
    echo ""
  done
 fi
 # =======================================
 # Sys (system tests) node images
 # =======================================
 if [[ "$build_sys" == true ]]; then
  echo ""
  echo "Building Sys node images..."
  sys_base_images=(
    "ubuntu:22.04"
  )
  # GPU 相关镜像目前仅在 x86_64 上支持；ARM 上不拉取 nvidia/cuda 基础镜像
  if [[ "${ARGUS_TARGET_ARCH:-}" != "arm64" ]]; then
    sys_base_images+=("nvidia/cuda:12.2.2-runtime-ubuntu22.04")
  fi
  for base_image in "${sys_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  sys_builds=(
    "Sys Node|src/sys/build/node/Dockerfile|argus-sys-node-arm64:latest|."
    "Sys Metric Test Node|src/sys/build/arm-cpu-node/Dockerfile|argus-sys-metric-test-node-arm64:latest|."
  )
  # GPU 测试节点镜像仅在 x86_64 路径构建，ARM 版本暂不支持 DCGM/GPU
  if [[ "${ARGUS_TARGET_ARCH:-}" != "arm64" ]]; then
    sys_builds+=("Sys Metric Test GPU Node|src/sys/build/test-gpu-node/Dockerfile|argus-sys-metric-test-gpu-node:latest|.")
  fi
  for build_spec in "${sys_builds[@]}"; do
    IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
    if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
      images_built+=("$image_tag")
      # 与历史 NODE_BUNDLE_IMAGE_TAG 保持兼容：为 ARM CPU 节点镜像打 bundle 别名
      if [[ "$image_tag" == "argus-sys-metric-test-node-arm64:latest" ]]; then
        docker tag "$image_tag" argus-sys-metric-test-node-bundle-arm64:latest >/dev/null 2>&1 || true
        docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
      fi
    else
      build_failed=true
    fi
    echo ""
  done
 fi
 # =======================================
 # Web & Alert module images
 # =======================================
 if [[ "$build_web" == true || "$build_alert" == true ]]; then
  echo ""
  echo "Building Web and Alert module images..."
  # Pre-pull commonly used base images for stability
  web_alert_base_images=(
    "node:20"
    "ubuntu:24.04"
  )
  for base_image in "${web_alert_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  if [[ "$build_web" == true ]]; then
    web_builds=(
      "Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend-arm64:${DEFAULT_IMAGE_TAG}|."
      "Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy-arm64:${DEFAULT_IMAGE_TAG}|."
    )
    for build_spec in "${web_builds[@]}"; do
      IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
      if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
        images_built+=("$image_tag")
      else
        build_failed=true
      fi
      echo ""
    done
  fi
  if [[ "$build_alert" == true ]]; then
    alert_builds=(
      "Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager-arm64:${DEFAULT_IMAGE_TAG}|."
    )
    for build_spec in "${alert_builds[@]}"; do
      IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
      if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
        images_built+=("$image_tag")
      else
        build_failed=true
      fi
      echo ""
    done
  fi
 fi
 # =======================================
 # One-click GPU bundle (direct NVIDIA base)
 # =======================================
 if [[ "$build_gpu_bundle" == true ]]; then
  echo ""
  echo "Building one-click GPU bundle image..."
  if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
    build_failed=true
  fi
 fi
 # =======================================
 # One-click CPU bundle (from ubuntu:22.04)
 # =======================================
 if [[ "$build_cpu_bundle" == true ]]; then
  echo ""
  echo "Building one-click CPU bundle image..."
  if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
    build_failed=true
  fi
 fi
 # =======================================
 # One-click Server/Client packaging
 # =======================================
 if [[ "$build_server_pkg" == true ]]; then
  echo ""
  echo "🧳 Building one-click Server package..."
  if ! build_server_pkg_bundle "${bundle_date}"; then
    build_failed=true
  fi
 fi
 if [[ "$build_client_pkg" == true ]]; then
  echo ""
  echo "🧳 Building one-click Client-GPU package..."
  if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
    build_failed=true
  fi
 fi
 echo "======================================="
 echo "📦 Build Summary"
 echo "======================================="
 if [[ ${#images_built[@]} -gt 0 ]]; then
  echo "✅ Successfully built images:"
  for image in "${images_built[@]}"; do
    echo "   • $image"
  done
 fi
 if [[ "$build_failed" == true ]]; then
  echo ""
  echo "❌ Some images failed to build. Please check the errors above."
  exit 1
 fi
 if [[ "$use_intranet" == true ]]; then
  echo ""
  echo "🌐 Built with intranet mirror configuration"
 fi
 if [[ "$build_master_offline" == true ]]; then
  echo ""
  echo "🧳 Master offline wheels 已解压到 $master_offline_dir"
 fi
 echo ""
 echo "🚀 Next steps:"
 echo "   ./build/save_images.sh --compress          # 导出镜像"
 echo "   cd src/master/tests && MASTER_IMAGE_TAG=argus-master:offline ./scripts/00_e2e_test.sh"
 echo ""
--- a/build/build_images_for_x64.sh
+++ b/build/build_images_for_x64.sh
@ -0,0 +1,875 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # ARGUS x86_64 Image Build Entry
 # 本脚本用于在 x86_64 平台上构建 Argus 镜像，
 # 逻辑与历史版本的 build/build_images.sh 保持一致。
 show_help() {
  cat <<'EOF'
 ARGUS Unified Build System - Image Build Tool
 Usage: $0 [OPTIONS]
 Options:
  --intranet        Use intranet mirror for log/bind builds
  --master-offline  Build master offline image (requires src/master/offline_wheels.tar.gz)
  --metric          Build metric module images (ftp, prometheus, grafana, test nodes)
  --no-cache        Build all images without using Docker layer cache
  --only LIST       Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
  --version DATE    Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
  --client-semver X.Y.Z  Override client semver used in all-in-one-full artifact (optional)
  --cuda VER        CUDA runtime version for NVIDIA base (default: 12.2.2)
  --tag-latest      Also tag bundle image as :latest (for cpu_bundle only; default off)
  -h, --help        Show this help message
 Examples:
  $0                             # Build with default sources
  $0 --intranet                  # Build with intranet mirror
  $0 --master-offline            # Additionally build argus-master:offline
  $0 --metric                    # Additionally build metric module images
  $0 --intranet --master-offline --metric
 EOF
 }
 use_intranet=false
 build_core=true
 build_master=true
 build_master_offline=false
 build_metric=true
 build_web=true
 build_alert=true
 build_sys=true
 build_gpu_bundle=false
 build_cpu_bundle=false
 build_server_pkg=false
 build_client_pkg=false
 no_cache=false
 bundle_date=""
 client_semver=""
 cuda_ver="12.2.2"
 DEFAULT_IMAGE_TAG="latest"
 tag_latest=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --intranet)
      use_intranet=true
      shift
      ;;
    --master)
      build_master=true
      shift
      ;;
    --master-offline)
      build_master=true
      build_master_offline=true
      shift
      ;;
    --metric)
      build_metric=true
      shift
      ;;
    --no-cache)
      no_cache=true
      shift
      ;;
    --only)
      if [[ -z ${2:-} ]]; then
        echo "--only requires a target list" >&2; exit 1
      fi
      sel="$2"; shift 2
      # reset all, then enable selected
      build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
      IFS=',' read -ra parts <<< "$sel"
      for p in "${parts[@]}"; do
        case "$p" in
          core) build_core=true ;;
          master) build_master=true ;;
          metric) build_metric=true ;;
          web) build_web=true ;;
          alert) build_alert=true ;;
          sys) build_sys=true ;;
          gpu_bundle) build_gpu_bundle=true ;;
          cpu_bundle) build_cpu_bundle=true ;;
          server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
          client_pkg) build_client_pkg=true ;;
          all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
          *) echo "Unknown --only target: $p" >&2; exit 1 ;;
        esac
      done
      ;;
    --version)
      if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
      bundle_date="$2"; shift 2
      ;;
    --client-semver)
      if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
      client_semver="$2"; shift 2
      ;;
    --cuda)
      if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
      cuda_ver="$2"; shift 2
      ;;
    --tag-latest)
      tag_latest=true
      shift
      ;;
    -h|--help)
      show_help
      exit 0
      ;;
    *)
      echo "Unknown option: $1" >&2
      show_help
      exit 1
      ;;
  esac
 done
 root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 . "$root/scripts/common/build_user.sh"
 declare -a build_args=()
 if [[ "$use_intranet" == true ]]; then
  build_args+=("--build-arg" "USE_INTRANET=true")
 fi
 cd "$root"
 # Set default image tag policy before building
 if [[ "$build_server_pkg" == true ]]; then
  DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
 fi
 # Select build user profile for pkg vs default
 if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
  export ARGUS_BUILD_PROFILE=pkg
 fi
 load_build_user
 build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
 if [[ "$no_cache" == true ]]; then
  build_args+=("--no-cache")
 fi
 master_root="$root/src/master"
 master_offline_tar="$master_root/offline_wheels.tar.gz"
 master_offline_dir="$master_root/offline_wheels"
 if [[ "$build_master_offline" == true ]]; then
  if [[ ! -f "$master_offline_tar" ]]; then
    echo "❌ offline wheels tar not found: $master_offline_tar" >&2
    echo "   请提前准备好 offline_wheels.tar.gz 后再执行 --master-offline" >&2
    exit 1
  fi
  echo "📦 Preparing offline wheels for master (extracting $master_offline_tar)"
  rm -rf "$master_offline_dir"
  mkdir -p "$master_offline_dir"
  tar -xzf "$master_offline_tar" -C "$master_root"
  has_wheel=$(find "$master_offline_dir" -maxdepth 1 -type f -name '*.whl' -print -quit)
  if [[ -z "$has_wheel" ]]; then
    echo "❌ offline_wheels extraction failed或无 wheel: $master_offline_dir" >&2
    exit 1
  fi
 fi
 echo "======================================="
 echo "ARGUS Unified Build System"
 echo "======================================="
 if [[ "$use_intranet" == true ]]; then
  echo "🌐 Mode: Intranet (Using internal mirror: 10.68.64.1)"
 else
  echo "🌐 Mode: Public (Using default package sources)"
 fi
 echo "👤 Build user UID:GID -> ${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}"
 echo "📁 Build context: $root"
 echo ""
 build_image() {
  local image_name=$1
  local dockerfile_path=$2
  local tag=$3
  local context="."
  shift 3
  if [[ $# -gt 0 ]]; then
    context=$1
    shift
  fi
  local extra_args=("$@")
  echo "🔄 Building $image_name image..."
  echo "   Dockerfile: $dockerfile_path"
  echo "   Tag: $tag"
  echo "   Context: $context"
  local tries=${ARGUS_BUILD_RETRIES:-3}
  local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
  local attempt=1
  while (( attempt <= tries )); do
    local prefix=""
    if (( attempt == tries )); then
      # final attempt: disable BuildKit to avoid docker/dockerfile front-end pulls
      prefix="DOCKER_BUILDKIT=0"
      echo "   Attempt ${attempt}/${tries} (fallback: DOCKER_BUILDKIT=0)"
    else
      echo "   Attempt ${attempt}/${tries}"
    fi
    if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
      echo "✅ $image_name image built successfully"
      return 0
    fi
    echo "⚠️  Build failed for $image_name (attempt ${attempt}/${tries})."
    if (( attempt < tries )); then
      echo "   Retrying in ${delay}s..."
      sleep "$delay"
    fi
    attempt=$((attempt+1))
  done
  echo "❌ Failed to build $image_name image after ${tries} attempts"
  return 1
 }
 pull_base_image() {
  local image_ref=$1
  local attempts=${2:-3}
  local delay=${3:-5}
  # If the image already exists locally, skip pulling.
  if docker image inspect "$image_ref" >/dev/null 2>&1; then
    echo "   Local image present; skip pull: $image_ref"
    return 0
  fi
  for ((i=1; i<=attempts; i++)); do
    echo "   Pulling base image ($i/$attempts): $image_ref"
    if docker pull "$image_ref" >/dev/null; then
      echo "   Base image ready: $image_ref"
      return 0
    fi
    echo "   Pull failed: $image_ref"
    if (( i < attempts )); then
      echo "   Retrying in ${delay}s..."
      sleep "$delay"
    fi
  done
  echo "❌ Unable to pull base image after ${attempts} attempts: $image_ref"
  return 1
 }
 images_built=()
 build_failed=false
 build_gpu_bundle_image() {
  local date_tag="$1"      # e.g. 20251112
  local cuda_ver_local="$2" # e.g. 12.2.2
  local client_ver="$3"     # semver like 1.43.0
  if [[ -z "$date_tag" ]]; then
    echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
    return 1
  fi
  # sanitize cuda version (trim trailing dots like '12.2.')
  while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
  # Resolve effective CUDA base tag
  local resolve_cuda_base_tag
  resolve_cuda_base_tag() {
    local want="$1" # can be 12, 12.2 or 12.2.2
    local major minor patch
    if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
      echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
    elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
      # try to find best local patch for major.minor
      local best
      best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
             grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
             sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
             sort -V | tail -n1 || true)
      if [[ -n "$best" ]]; then
        echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
      fi
      # fallback patch if none local
      echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
    elif [[ "$want" =~ ^([0-9]+)$ ]]; then
      major="${BASH_REMATCH[1]}"
      # try to find best local for this major
      local best
      best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
             grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
             sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
             sort -V | tail -n1 || true)
      if [[ -n "$best" ]]; then
        echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
      fi
      echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
    else
      # invalid format, fallback to default
      echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
    fi
  }
  local base_image
  base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
  echo
  echo "🔧 Preparing one-click GPU bundle build"
  echo "   CUDA runtime base: ${base_image}"
  echo "   Bundle tag       : ${date_tag}"
  # 1) Ensure NVIDIA base image (skip pull if local)
  if ! pull_base_image "$base_image"; then
    # try once more with default if resolution failed
    if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
      return 1
    else
      base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
    fi
  fi
  # 2) Build latest argus-agent from source
  echo "\n🛠  Building argus-agent from src/agent"
  pushd "$root/src/agent" >/dev/null
  if ! bash scripts/build_binary.sh; then
    echo "❌ argus-agent build failed" >&2
    popd >/dev/null
    return 1
  fi
  if [[ ! -f "dist/argus-agent" ]]; then
    echo "❌ argus-agent binary missing after build" >&2
    popd >/dev/null
    return 1
  fi
  popd >/dev/null
  # 3) Inject agent into all-in-one-full plugin and package artifact
  local aio_root="$root/src/metric/client-plugins/all-in-one-full"
  local agent_bin_src="$root/src/agent/dist/argus-agent"
  local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
  echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
  cp -f "$agent_bin_src" "$agent_bin_dst"
  chmod +x "$agent_bin_dst" || true
  pushd "$aio_root" >/dev/null
  local prev_version
  prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
  local use_version="$prev_version"
  if [[ -n "$client_semver" ]]; then
    echo "${client_semver}" > config/VERSION
    use_version="$client_semver"
  fi
  echo "   Packaging all-in-one-full artifact version: $use_version"
  if ! bash scripts/package_artifact.sh --force; then
    echo "❌ package_artifact.sh failed" >&2
    # restore VERSION if changed
    if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
    popd >/dev/null
    return 1
  fi
  local artifact_dir="$aio_root/artifact/$use_version"
  local artifact_tar
  artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  if [[ -z "$artifact_tar" ]]; then
    echo "   No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
    local owner="$(id -u):$(id -g)"
    if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
      echo "❌ publish_artifact.sh failed" >&2
      if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
      popd >/dev/null
      return 1
    fi
    artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  fi
  if [[ -z "$artifact_tar" ]]; then
    echo "❌ artifact tar not found under $artifact_dir" >&2
    if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
    popd >/dev/null
    return 1
  fi
  # restore VERSION if changed (keep filesystem clean)
  if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
  popd >/dev/null
  # 4) Stage docker build context
  local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
  echo "\n🧰 Staging docker build context: $bundle_ctx"
  rm -rf "$bundle_ctx"
  mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
  cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
  cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
  cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
  # bundle tar
  cp "$artifact_tar" "$bundle_ctx/bundle/"
  # offline fluent-bit assets (optional but useful)
  if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
    cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
  fi
  if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
    cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
  fi
  if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
    cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
  fi
  # 5) Build the final bundle image (directly from NVIDIA base)
  local image_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
  echo "\n🔄 Building GPU Bundle image"
  if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
      --build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
      --build-arg CLIENT_VER="$use_version" \
      --build-arg BUNDLE_DATE="$date_tag"; then
    images_built+=("$image_tag")
    # In non-pkg mode, also tag latest for convenience
    if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
      docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu:latest >/dev/null 2>&1 || true
    fi
    return 0
  else
    return 1
  fi
 }
 # Tag helper: ensure :<date_tag> exists for a list of repos
 ensure_version_tags() {
  local date_tag="$1"; shift
  local repos=("$@")
  for repo in "${repos[@]}"; do
    if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
      :
    elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
      docker tag "$repo:latest" "$repo:$date_tag" || true
    else
      echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
      return 1
    fi
  done
  return 0
 }
 # Build server package after images are built
 build_server_pkg_bundle() {
  local date_tag="$1"
  if [[ -z "$date_tag" ]]; then
    echo "❌ server_pkg requires --version YYMMDD" >&2
    return 1
  fi
  local repos=(
    argus-bind9 argus-master argus-elasticsearch argus-kibana \
    argus-metric-ftp argus-metric-prometheus argus-metric-grafana \
    argus-alertmanager argus-web-frontend argus-web-proxy
  )
  echo "\n🔖 Verifying server images with :$date_tag and collecting digests"
  for repo in "${repos[@]}"; do
    if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
      echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
      return 1
    fi
  done
  # Optional: show digests
  for repo in "${repos[@]}"; do
    local digest
    digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
    printf '   • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
  done
  echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
  if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
    echo "❌ make_server_package.sh failed" >&2
    return 1
  fi
  return 0
 }
 # Build client package: ensure gpu bundle image exists, then package client_gpu
 build_client_pkg_bundle() {
  local date_tag="$1"
  local semver="$2"
  local cuda="$3"
  if [[ -z "$date_tag" ]]; then
    echo "❌ client_pkg requires --version YYMMDD" >&2
    return 1
  fi
  local bundle_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
  if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
    echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
    ARGUS_PKG_BUILD=1
    export ARGUS_PKG_BUILD
    if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
      return 1
    fi
  else
    echo "\n✅ Using existing GPU bundle image: $bundle_tag"
  fi
  echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
  if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
    echo "❌ make_client_gpu_package.sh failed" >&2
    return 1
  fi
  return 0
 }
 # Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
 build_cpu_bundle_image() {
  local date_tag="$1"         # e.g. 20251113
  local client_ver_in="$2"    # semver like 1.43.0 (optional)
  local want_tag_latest="$3"   # true/false
  if [[ -z "$date_tag" ]]; then
    echo "❌ cpu_bundle requires --version YYMMDD" >&2
    return 1
  fi
  echo "\n🔧 Preparing one-click CPU bundle build"
  echo "   Base: ubuntu:22.04"
  echo "   Bundle tag: ${date_tag}"
  # 1) Build latest argus-agent from source
  echo "\n🛠  Building argus-agent from src/agent"
  pushd "$root/src/agent" >/dev/null
  if ! bash scripts/build_binary.sh; then
    echo "❌ argus-agent build failed" >&2
    popd >/dev/null
    return 1
  fi
  if [[ ! -f "dist/argus-agent" ]]; then
    echo "❌ argus-agent binary missing after build" >&2
    popd >/dev/null
    return 1
  fi
  popd >/dev/null
  # 2) Inject agent into all-in-one-full plugin and package artifact
  local aio_root="$root/src/metric/client-plugins/all-in-one-full"
  local agent_bin_src="$root/src/agent/dist/argus-agent"
  local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
  echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
  cp -f "$agent_bin_src" "$agent_bin_dst"
  chmod +x "$agent_bin_dst" || true
  pushd "$aio_root" >/dev/null
  local prev_version use_version
  prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
  use_version="$prev_version"
  if [[ -n "$client_ver_in" ]]; then
    echo "$client_ver_in" > config/VERSION
    use_version="$client_ver_in"
  fi
  echo "   Packaging all-in-one-full artifact: version=$use_version"
  if ! bash scripts/package_artifact.sh --force; then
    echo "❌ package_artifact.sh failed" >&2
    [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
    popd >/dev/null
    return 1
  fi
  local artifact_dir="$aio_root/artifact/$use_version"
  local artifact_tar
  artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  if [[ -z "$artifact_tar" ]]; then
    echo "   No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
    local owner="$(id -u):$(id -g)"
    if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
      echo "❌ publish_artifact.sh failed" >&2
      [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
      popd >/dev/null
      return 1
    fi
    artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
  fi
  [[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
  popd >/dev/null
  # 3) Stage docker build context
  local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
  echo "\n🧰 Staging docker build context: $bundle_ctx"
  rm -rf "$bundle_ctx"
  mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
  cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
  cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
  cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
  # bundle tar
  cp "$artifact_tar" "$bundle_ctx/bundle/"
  # offline fluent-bit assets
  if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
    cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
  fi
  if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
    cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
  fi
  if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
    cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
  fi
  # 4) Build final bundle image
  local image_tag="argus-sys-metric-test-node-bundle:${date_tag}"
  echo "\n🔄 Building CPU Bundle image"
  if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
    images_built+=("$image_tag")
    if [[ "$want_tag_latest" == "true" ]]; then
      docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
    fi
    return 0
  else
    return 1
  fi
 }
 if [[ "$build_core" == true ]]; then
  if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-elasticsearch:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
  echo ""
  if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-kibana:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
  echo ""
  if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:${DEFAULT_IMAGE_TAG}"; then
    images_built+=("argus-bind9:${DEFAULT_IMAGE_TAG}")
  else
    build_failed=true
  fi
 fi
 echo ""
 if [[ "$build_master" == true ]]; then
  echo ""
  echo "🔄 Building Master image..."
  pushd "$master_root" >/dev/null
  master_args=("--tag" "argus-master:${DEFAULT_IMAGE_TAG}")
  if [[ "$use_intranet" == true ]]; then
    master_args+=("--intranet")
  fi
  if [[ "$build_master_offline" == true ]]; then
    master_args+=("--offline")
  fi
  if [[ "$no_cache" == true ]]; then
    master_args+=("--no-cache")
  fi
  if ./scripts/build_images.sh "${master_args[@]}"; then
    if [[ "$build_master_offline" == true ]]; then
      images_built+=("argus-master:offline")
    else
      images_built+=("argus-master:${DEFAULT_IMAGE_TAG}")
    fi
  else
    build_failed=true
  fi
  popd >/dev/null
 fi
 if [[ "$build_metric" == true ]]; then
  echo ""
  echo "Building Metric module images..."
  metric_base_images=(
    "ubuntu:22.04"
    "ubuntu/prometheus:3-24.04_stable"
    "grafana/grafana:11.1.0"
  )
  for base_image in "${metric_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  metric_builds=(
    "Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build"
    "Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
    "Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
    "Metric Prometheus Targets Updater|src/metric/prometheus/build/Dockerfile.targets-updater|argus-metric-prometheus-targets-updater:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
  )
  for build_spec in "${metric_builds[@]}"; do
    IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
    if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
      images_built+=("$image_tag")
    else
      build_failed=true
    fi
    echo ""
  done
 fi
 # =======================================
 # Sys (system tests) node images
 # =======================================
 if [[ "$build_sys" == true ]]; then
  echo ""
  echo "Building Sys node images..."
  sys_base_images=(
    "ubuntu:22.04"
    "nvidia/cuda:12.2.2-runtime-ubuntu22.04"
  )
  for base_image in "${sys_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  sys_builds=(
    "Sys Node|src/sys/build/node/Dockerfile|argus-sys-node:latest|."
    "Sys Metric Test Node|src/sys/build/test-node/Dockerfile|argus-sys-metric-test-node:latest|."
    "Sys Metric Test GPU Node|src/sys/build/test-gpu-node/Dockerfile|argus-sys-metric-test-gpu-node:latest|."
  )
  for build_spec in "${sys_builds[@]}"; do
    IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
    if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
      images_built+=("$image_tag")
    else
      build_failed=true
    fi
    echo ""
  done
 fi
 # =======================================
 # Web & Alert module images
 # =======================================
 if [[ "$build_web" == true || "$build_alert" == true ]]; then
  echo ""
  echo "Building Web and Alert module images..."
  # Pre-pull commonly used base images for stability
  web_alert_base_images=(
    "node:20"
    "ubuntu:24.04"
  )
  for base_image in "${web_alert_base_images[@]}"; do
    if ! pull_base_image "$base_image"; then
      build_failed=true
    fi
  done
  if [[ "$build_web" == true ]]; then
    web_builds=(
      "Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:${DEFAULT_IMAGE_TAG}|."
      "Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:${DEFAULT_IMAGE_TAG}|."
    )
    for build_spec in "${web_builds[@]}"; do
      IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
      if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
        images_built+=("$image_tag")
      else
        build_failed=true
      fi
      echo ""
    done
  fi
  if [[ "$build_alert" == true ]]; then
    alert_builds=(
      "Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:${DEFAULT_IMAGE_TAG}|."
    )
    for build_spec in "${alert_builds[@]}"; do
      IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
      if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
        images_built+=("$image_tag")
      else
        build_failed=true
      fi
      echo ""
    done
  fi
 fi
 # =======================================
 # One-click GPU bundle (direct NVIDIA base)
 # =======================================
 if [[ "$build_gpu_bundle" == true ]]; then
  echo ""
  echo "Building one-click GPU bundle image..."
  if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
    build_failed=true
  fi
 fi
 # =======================================
 # One-click CPU bundle (from ubuntu:22.04)
 # =======================================
 if [[ "$build_cpu_bundle" == true ]]; then
  echo ""
  echo "Building one-click CPU bundle image..."
  if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
    build_failed=true
  fi
 fi
 # =======================================
 # One-click Server/Client packaging
 # =======================================
 if [[ "$build_server_pkg" == true ]]; then
  echo ""
  echo "🧳 Building one-click Server package..."
  if ! build_server_pkg_bundle "${bundle_date}"; then
    build_failed=true
  fi
 fi
 if [[ "$build_client_pkg" == true ]]; then
  echo ""
  echo "🧳 Building one-click Client-GPU package..."
  if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
    build_failed=true
  fi
 fi
 echo "======================================="
 echo "📦 Build Summary"
 echo "======================================="
 if [[ ${#images_built[@]} -gt 0 ]]; then
  echo "✅ Successfully built images:"
  for image in "${images_built[@]}"; do
    echo "   • $image"
  done
 fi
 if [[ "$build_failed" == true ]]; then
  echo ""
  echo "❌ Some images failed to build. Please check the errors above."
  exit 1
 fi
 if [[ "$use_intranet" == true ]]; then
  echo ""
  echo "🌐 Built with intranet mirror configuration"
 fi
 if [[ "$build_master_offline" == true ]]; then
  echo ""
  echo "🧳 Master offline wheels 已解压到 $master_offline_dir"
 fi
 echo ""
 echo "🚀 Next steps:"
 echo "   ./build/save_images.sh --compress          # 导出镜像"
 echo "   cd src/master/tests && MASTER_IMAGE_TAG=argus-master:offline ./scripts/00_e2e_test.sh"
 echo ""
--- a/build/save_images.sh
+++ b/build/save_images.sh
@ -0,0 +1,229 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # 帮助信息
 show_help() {
    cat << EOF
 ARGUS Unified Build System - Image Export Tool
 Usage: $0 [OPTIONS]
 Options:
  --compress    Compress exported images with gzip
  -h, --help    Show this help message
 Examples:
  $0                # Export all images without compression
  $0 --compress     # Export all images with gzip compression
 EOF
 }
 # 解析命令行参数
 use_compression=false
 while [[ $# -gt 0 ]]; do
    case $1 in
        --compress)
            use_compression=true
            shift
            ;;
        -h|--help)
            show_help
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            show_help
            exit 1
            ;;
    esac
 done
 # 获取项目根目录
 root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 cd "$root"
 # 创建镜像输出目录
 images_dir="$root/images"
 mkdir -p "$images_dir"
 echo "======================================="
 echo "ARGUS Unified Build System - Image Export"
 echo "======================================="
 echo ""
 if [[ "$use_compression" == true ]]; then
    echo "🗜️  Mode: With gzip compression"
 else
    echo "📦 Mode: No compression"
 fi
 echo "📁 Output directory: $images_dir"
 echo ""
 # 定义镜像列表
 declare -A images=(
    ["argus-elasticsearch:latest"]="argus-elasticsearch-latest.tar"
    ["argus-kibana:latest"]="argus-kibana-latest.tar"
    ["argus-bind9:latest"]="argus-bind9-latest.tar"
    ["argus-master:offline"]="argus-master-offline.tar"
    ["argus-metric-ftp:latest"]="argus-metric-ftp-latest.tar"
    ["argus-metric-prometheus:latest"]="argus-metric-prometheus-latest.tar"
    ["argus-metric-grafana:latest"]="argus-metric-grafana-latest.tar"
    ["argus-web-frontend:latest"]="argus-web-frontend-latest.tar"
    ["argus-web-proxy:latest"]="argus-web-proxy-latest.tar"
    ["argus-alertmanager:latest"]="argus-alertmanager-latest.tar"
 )
 # 函数：检查镜像是否存在
 check_image() {
    local image_name="$1"
    if docker images --format "{{.Repository}}:{{.Tag}}" | grep -q "^$image_name$"; then
        echo "✅ Image found: $image_name"
        return 0
    else
        echo "❌ Image not found: $image_name"
        return 1
    fi
 }
 # 函数：显示镜像信息
 show_image_info() {
    local image_name="$1"
    echo "📋 Image info for $image_name:"
    docker images "$image_name" --format "   Size: {{.Size}}, Created: {{.CreatedSince}}, ID: {{.ID}}"
 }
 # 函数：保存镜像
 save_image() {
    local image_name="$1"
    local output_file="$2"
    local output_path="$images_dir/$output_file"
    echo "🔄 Saving $image_name to $output_file..."
    # 删除旧的镜像文件（如果存在）
    if [[ -f "$output_path" ]]; then
        echo "   Removing existing file: $output_file"
        rm "$output_path"
    fi
    if [[ "$use_compression" == true && -f "$output_path.gz" ]]; then
        echo "   Removing existing compressed file: $output_file.gz"
        rm "$output_path.gz"
    fi
    # 保存镜像
    docker save "$image_name" -o "$output_path"
    if [[ "$use_compression" == true ]]; then
        echo "   Compressing with gzip..."
        gzip "$output_path"
        output_path="$output_path.gz"
        output_file="$output_file.gz"
    fi
    # 检查文件大小
    local file_size=$(du -h "$output_path" | cut -f1)
    echo "✅ Saved successfully: $output_file ($file_size)"
 }
 echo "🔍 Checking for ARGUS images..."
 echo ""
 # 检查所有镜像
 available_images=()
 missing_images=()
 for image_name in "${!images[@]}"; do
    if check_image "$image_name"; then
        show_image_info "$image_name"
        available_images+=("$image_name")
    else
        missing_images+=("$image_name")
    fi
    echo ""
 done
 # 如果没有镜像存在，提示构建
 if [[ ${#available_images[@]} -eq 0 ]]; then
    echo "❌ No ARGUS images found to export."
    echo ""
    echo "🔧 Please build the images first with:"
    echo "   ./build/build_images.sh"
    exit 1
 fi
 # 显示缺失的镜像
 if [[ ${#missing_images[@]} -gt 0 ]]; then
    echo "⚠️  Missing images (will be skipped):"
    for image_name in "${missing_images[@]}"; do
        echo "   • $image_name"
    done
    echo ""
 fi
 echo "💾 Starting image export process..."
 echo ""
 # 保存所有可用的镜像
 exported_files=()
 for image_name in "${available_images[@]}"; do
    output_file="${images[$image_name]}"
    save_image "$image_name" "$output_file"
    if [[ "$use_compression" == true ]]; then
        exported_files+=("$output_file.gz")
    else
        exported_files+=("$output_file")
    fi
    echo ""
 done
 echo "======================================="
 echo "📦 Export Summary"
 echo "======================================="
 # 显示导出的文件
 echo "📁 Exported files in $images_dir:"
 total_size=0
 for file in "${exported_files[@]}"; do
    full_path="$images_dir/$file"
    if [[ -f "$full_path" ]]; then
        size=$(du -h "$full_path" | cut -f1)
        size_bytes=$(du -b "$full_path" | cut -f1)
        total_size=$((total_size + size_bytes))
        echo "   ✅ $file ($size)"
    fi
 done
 # 显示总大小
 if [[ $total_size -gt 0 ]]; then
    total_size_human=$(numfmt --to=iec --suffix=B $total_size)
    echo ""
    echo "📊 Total size: $total_size_human"
 fi
 echo ""
 echo "🚀 Usage instructions:"
 echo "   To load these images on another system:"
 if [[ "$use_compression" == true ]]; then
    for file in "${exported_files[@]}"; do
        if [[ -f "$images_dir/$file" ]]; then
            base_name="${file%.gz}"
            echo "     gunzip $file && docker load -i $base_name"
        fi
    done
 else
    for file in "${exported_files[@]}"; do
        if [[ -f "$images_dir/$file" ]]; then
            echo "     docker load -i $file"
        fi
    done
 fi
 echo ""
 echo "✅ Image export completed successfully!"
 echo ""
--- a/configs/.gitignore
+++ b/configs/.gitignore
@ -0,0 +1,2 @@
 # Local overrides for build user/group settings
 build_user.local.conf
--- a/configs/build_user.conf
+++ b/configs/build_user.conf
@ -0,0 +1,6 @@
 # Default build-time UID/GID for Argus images
 # Override by creating configs/build_user.local.conf with the same format.
 # Syntax: KEY=VALUE, supports UID/GID only. Whitespace and lines starting with # are ignored.
 UID=2133
 GID=2015
--- a/configs/build_user.pkg.conf
+++ b/configs/build_user.pkg.conf
@ -0,0 +1,6 @@
 # Default build-time UID/GID for Argus images
 # Override by creating configs/build_user.local.conf with the same format.
 # Syntax: KEY=VALUE, supports UID/GID only. Whitespace and lines starting with # are ignored.
 UID=2133
 GID=2015
--- a/deployment_new/.gitignore
+++ b/deployment_new/.gitignore
@ -0,0 +1 @@
 artifact/
--- a/deployment_new/README.md
+++ b/deployment_new/README.md
@ -0,0 +1,14 @@
 # deployment_new
 本目录用于新的部署打包与交付实现（不影响既有 `deployment/`）。
 里程碑 M1（当前实现）
 - `build/make_server_package.sh`：生成 Server 包（逐服务镜像 tar.gz、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz）。
 - `build/make_client_gpu_package.sh`：生成 Client‑GPU 包（GPU bundle 镜像 tar.gz、busybox.tar、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz）。
 模板
 - `templates/server/compose/docker-compose.yml`：部署专用，镜像默认使用 `:${PKG_VERSION}` 版本 tag，可通过 `.env` 覆盖。
 - `templates/client_gpu/compose/docker-compose.yml`：GPU 节点专用，使用 `:${PKG_VERSION}` 版本 tag。
 注意：M1 仅产出安装包，不包含安装脚本落地；安装/运维脚本将在 M2 落地并纳入包内。
--- a/deployment_new/build/common.sh
+++ b/deployment_new/build/common.sh
@ -0,0 +1,33 @@
 #!/usr/bin/env bash
 set -euo pipefail
 log() { echo -e "\033[0;34m[INFO]\033[0m $*"; }
 warn() { echo -e "\033[1;33m[WARN]\033[0m $*"; }
 err()  { echo -e "\033[0;31m[ERR ]\033[0m $*" >&2; }
 require_cmd() {
  local miss=0
  for c in "$@"; do
    if ! command -v "$c" >/dev/null 2>&1; then err "missing command: $c"; miss=1; fi
  done
  [[ $miss -eq 0 ]]
 }
 today_version() { date +%Y%m%d; }
 checksum_dir() {
  local dir="$1"; local out="$2"; : > "$out";
  (cd "$dir" && find . -type f -print0 | sort -z | xargs -0 sha256sum) >> "$out"
 }
 make_dir() { mkdir -p "$1"; }
 copy_tree() {
  local src="$1" dst="$2"; rsync -a --delete "$src/" "$dst/" 2>/dev/null || cp -r "$src/." "$dst/";
 }
 gen_manifest() {
  local root="$1"; local out="$2"; : > "$out";
  (cd "$root" && find . -maxdepth 4 -type f -printf "%p\n" | sort) >> "$out"
 }
--- a/deployment_new/build/make_client_gpu_package.sh
+++ b/deployment_new/build/make_client_gpu_package.sh
@ -0,0 +1,134 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # Make client GPU package (versioned gpu bundle image, compose, env, docs, busybox)
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
 TEMPL_DIR="$ROOT_DIR/deployment_new/templates/client_gpu"
 ART_ROOT="$ROOT_DIR/deployment_new/artifact/client_gpu"
 # Use deployment_new local common helpers
 COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
 . "$COMMON_SH"
 usage(){ cat <<EOF
 Build Client-GPU Package (deployment_new)
 Usage: $(basename "$0") --version YYYYMMDD [--image IMAGE[:TAG]]
 Defaults:
  image = argus-sys-metric-test-node-bundle-gpu:latest
 Outputs: deployment_new/artifact/client_gpu/<YYYYMMDD>/ and client_gpu_YYYYMMDD.tar.gz
 EOF
 }
 VERSION=""
 IMAGE="argus-sys-metric-test-node-bundle-gpu:latest"
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --version) VERSION="$2"; shift 2;;
    --image) IMAGE="$2"; shift 2;;
    -h|--help) usage; exit 0;;
    *) err "unknown arg: $1"; usage; exit 1;;
  esac
 done
 if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
 require_cmd docker tar gzip
 STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
 PKG_DIR="$ART_ROOT/$VERSION"
 mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
 # 1) Save GPU bundle image with version tag
 if ! docker image inspect "$IMAGE" >/dev/null 2>&1; then
  err "missing image: $IMAGE"; exit 1; fi
 REPO="${IMAGE%%:*}"; TAG_VER="$REPO:$VERSION"
 docker tag "$IMAGE" "$TAG_VER"
 out_tar="$STAGE/images/${REPO//\//-}-$VERSION.tar"
 docker save -o "$out_tar" "$TAG_VER"
 gzip -f "$out_tar"
 # 2) Busybox tar for connectivity/overlay warmup (prefer local template; fallback to docker save)
 BB_SRC="$TEMPL_DIR/images/busybox.tar"
 if [[ -f "$BB_SRC" ]]; then
  cp "$BB_SRC" "$STAGE/images/busybox.tar"
 else
  if docker image inspect busybox:latest >/dev/null 2>&1 || docker pull busybox:latest >/dev/null 2>&1; then
    docker save -o "$STAGE/images/busybox.tar" busybox:latest
    log "Included busybox from local docker daemon"
  else
    warn "busybox image not found and cannot pull; skipping busybox.tar"
  fi
 fi
 # 3) Compose + env template and docs/scripts from templates
 cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
 ENV_EX="$STAGE/compose/.env.example"
 cat >"$ENV_EX" <<EOF
 # Generated by make_client_gpu_package.sh
 PKG_VERSION=$VERSION
 NODE_GPU_BUNDLE_IMAGE_TAG=${REPO}:${VERSION}
 # Compose project name (isolation from server stack)
 COMPOSE_PROJECT_NAME=argus-client
 # Required (no defaults). Must be filled before install.
 AGENT_ENV=
 AGENT_USER=
 AGENT_INSTANCE=
 GPU_NODE_HOSTNAME=
 # From cluster-info.env (server package output)
 BINDIP=
 FTPIP=
 SWARM_MANAGER_ADDR=
 SWARM_JOIN_TOKEN_WORKER=
 SWARM_JOIN_TOKEN_MANAGER=
 # FTP defaults
 FTP_USER=ftpuser
 FTP_PASSWORD=NASPlab1234!
 EOF
 # 4) Docs from deployment_new templates
 CLIENT_DOC_SRC="$TEMPL_DIR/docs"
 if [[ -d "$CLIENT_DOC_SRC" ]]; then
  rsync -a "$CLIENT_DOC_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$CLIENT_DOC_SRC/." "$STAGE/docs/"
 fi
 # Placeholder scripts (will be implemented in M2)
 cat >"$STAGE/scripts/README.md" <<'EOF'
 # Client-GPU Scripts (Placeholder)
 本目录将在 M2 引入：
 - config.sh / install.sh
 当前为占位，便于包结构审阅。
 EOF
 # 5) Scripts (from deployment_new templates) and Private skeleton
 SCRIPTS_SRC="$TEMPL_DIR/scripts"
 if [[ -d "$SCRIPTS_SRC" ]]; then
  rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
  find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
 fi
 mkdir -p "$STAGE/private/argus/agent"
 # 6) Manifest & checksums
 gen_manifest "$STAGE" "$STAGE/manifest.txt"
 checksum_dir "$STAGE" "$STAGE/checksums.txt"
 # 7) Move to artifact dir and pack
 mkdir -p "$PKG_DIR"
 rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
 OUT_TAR_DIR="$(dirname "$PKG_DIR")"
 OUT_TAR="$OUT_TAR_DIR/client_gpu_${VERSION}.tar.gz"
 log "Creating tarball: $OUT_TAR"
 (cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
 log "Client-GPU package ready: $PKG_DIR"
 echo "$OUT_TAR"
--- a/deployment_new/build/make_server_package.sh
+++ b/deployment_new/build/make_server_package.sh
@ -0,0 +1,169 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # Make server deployment package (versioned, per-image tars, full compose, docs, skeleton)
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
 TEMPL_DIR="$ROOT_DIR/deployment_new/templates/server"
 ART_ROOT="$ROOT_DIR/deployment_new/artifact/server"
 # Use deployment_new local common helpers
 COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
 . "$COMMON_SH"
 usage(){ cat <<EOF
 Build Server Deployment Package (deployment_new)
 Usage: $(basename "$0") --version YYYYMMDD
 Outputs: deployment_new/artifact/server/<YYYYMMDD>/ and server_YYYYMMDD.tar.gz
 EOF
 }
 VERSION=""
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --version) VERSION="$2"; shift 2;;
    -h|--help) usage; exit 0;;
    *) err "unknown arg: $1"; usage; exit 1;;
  esac
 done
 if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
 require_cmd docker tar gzip awk sed
 IMAGES=(
  argus-bind9
  argus-master
  argus-elasticsearch
  argus-kibana
  argus-metric-ftp
  argus-metric-prometheus
  argus-metric-grafana
  argus-alertmanager
  argus-web-frontend
  argus-web-proxy
 )
 STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
 PKG_DIR="$ART_ROOT/$VERSION"
 mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
 # 1) Save per-image tars with version tag
 log "Tagging and saving images (version=$VERSION)"
 for repo in "${IMAGES[@]}"; do
  if ! docker image inspect "$repo:latest" >/dev/null 2>&1 && ! docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
    err "missing image: $repo (need :latest or :$VERSION)"; exit 1; fi
  if docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
    tag="$repo:$VERSION"
  else
    docker tag "$repo:latest" "$repo:$VERSION"
    tag="$repo:$VERSION"
  fi
  out_tar="$STAGE/images/${repo//\//-}-$VERSION.tar"
  docker save -o "$out_tar" "$tag"
  gzip -f "$out_tar"
 done
 # 2) Compose + env template
 cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
 ENV_EX="$STAGE/compose/.env.example"
 cat >"$ENV_EX" <<EOF
 # Generated by make_server_package.sh
 PKG_VERSION=$VERSION
 # Image tags (can be overridden). Default to versioned tags
 BIND_IMAGE_TAG=argus-bind9:
 MASTER_IMAGE_TAG=argus-master:
 ES_IMAGE_TAG=argus-elasticsearch:
 KIBANA_IMAGE_TAG=argus-kibana:
 FTP_IMAGE_TAG=argus-metric-ftp:
 PROM_IMAGE_TAG=argus-metric-prometheus:
 GRAFANA_IMAGE_TAG=argus-metric-grafana:
 ALERT_IMAGE_TAG=argus-alertmanager:
 FRONT_IMAGE_TAG=argus-web-frontend:
 WEB_PROXY_IMAGE_TAG=argus-web-proxy:
 EOF
 sed -i "s#:\$#:${VERSION}#g" "$ENV_EX"
 # Ports and defaults (based on swarm_tests .env.example)
 cat >>"$ENV_EX" <<'EOF'
 # Host ports for server compose
 MASTER_PORT=32300
 ES_HTTP_PORT=9200
 KIBANA_PORT=5601
 PROMETHEUS_PORT=9090
 GRAFANA_PORT=3000
 ALERTMANAGER_PORT=9093
 WEB_PROXY_PORT_8080=8080
 WEB_PROXY_PORT_8081=8081
 WEB_PROXY_PORT_8082=8082
 WEB_PROXY_PORT_8083=8083
 WEB_PROXY_PORT_8084=8084
 WEB_PROXY_PORT_8085=8085
 # Overlay network name
 ARGUS_OVERLAY_NET=argus-sys-net
 # FTP defaults
 FTP_USER=ftpuser
 FTP_PASSWORD=NASPlab1234!
 # UID/GID for volume ownership
 ARGUS_BUILD_UID=2133
 ARGUS_BUILD_GID=2015
 # Compose project name (isolation from other stacks on same host)
 COMPOSE_PROJECT_NAME=argus-server
 EOF
 # 3) Docs (from deployment_new templates)
 DOCS_SRC="$TEMPL_DIR/docs"
 if [[ -d "$DOCS_SRC" ]]; then
  rsync -a "$DOCS_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$DOCS_SRC/." "$STAGE/docs/"
 fi
 # 6) Scripts (from deployment_new templates)
 SCRIPTS_SRC="$TEMPL_DIR/scripts"
 if [[ -d "$SCRIPTS_SRC" ]]; then
  rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
  find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
 fi
 # 4) Private skeleton (minimum)
 mkdir -p \
  "$STAGE/private/argus/etc" \
  "$STAGE/private/argus/master" \
  "$STAGE/private/argus/metric/prometheus" \
  "$STAGE/private/argus/metric/prometheus/data" \
  "$STAGE/private/argus/metric/prometheus/rules" \
  "$STAGE/private/argus/metric/prometheus/targets" \
  "$STAGE/private/argus/metric/grafana" \
  "$STAGE/private/argus/metric/grafana/data" \
  "$STAGE/private/argus/metric/grafana/logs" \
  "$STAGE/private/argus/metric/grafana/plugins" \
  "$STAGE/private/argus/metric/grafana/provisioning/datasources" \
  "$STAGE/private/argus/metric/grafana/provisioning/dashboards" \
  "$STAGE/private/argus/metric/grafana/data/sessions" \
  "$STAGE/private/argus/metric/grafana/data/dashboards" \
  "$STAGE/private/argus/metric/grafana/config" \
  "$STAGE/private/argus/metric/ftp" \
  "$STAGE/private/argus/alert/alertmanager" \
  "$STAGE/private/argus/log/elasticsearch" \
  "$STAGE/private/argus/log/kibana"
 # 7) Manifest & checksums
 gen_manifest "$STAGE" "$STAGE/manifest.txt"
 checksum_dir "$STAGE" "$STAGE/checksums.txt"
 # 8) Move to artifact dir and pack
 mkdir -p "$PKG_DIR"
 rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
 OUT_TAR_DIR="$(dirname "$PKG_DIR")"
 OUT_TAR="$OUT_TAR_DIR/server_${VERSION}.tar.gz"
 log "Creating tarball: $OUT_TAR"
 (cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
 log "Server package ready: $PKG_DIR"
 echo "$OUT_TAR"
--- a/deployment_new/templates/client_gpu/compose/docker-compose.yml
+++ b/deployment_new/templates/client_gpu/compose/docker-compose.yml
@ -0,0 +1,38 @@
 version: "3.8"
 networks:
  argus-sys-net:
    external: true
 services:
  metric-gpu-node:
    image: ${NODE_GPU_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle-gpu:${PKG_VERSION}}
    container_name: argus-metric-gpu-node-swarm
    hostname: ${GPU_NODE_HOSTNAME}
    restart: unless-stopped
    privileged: true
    runtime: nvidia
    environment:
      - TZ=Asia/Shanghai
      - DEBIAN_FRONTEND=noninteractive
      - MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
      # Fluent Bit / 日志上报目标（固定域名）
      - ES_HOST=es.log.argus.com
      - ES_PORT=9200
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
      - AGENT_ENV=${AGENT_ENV}
      - AGENT_USER=${AGENT_USER}
      - AGENT_INSTANCE=${AGENT_INSTANCE}
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - GPU_MODE=gpu
    networks:
      argus-sys-net:
        aliases:
          - ${AGENT_INSTANCE}.node.argus.com
    volumes:
      - ../private/argus/agent:/private/argus/agent
      - ../logs/infer:/logs/infer
      - ../logs/train:/logs/train
    command: ["sleep", "infinity"]
--- a/deployment_new/templates/client_gpu/docs/INSTALL_CLIENT_zh.md
+++ b/deployment_new/templates/client_gpu/docs/INSTALL_CLIENT_zh.md
@ -0,0 +1,73 @@
 # Argus Client‑GPU 安装指南（deployment_new）
 ## 一、准备条件（开始前确认）
 - GPU 节点安装了 NVIDIA 驱动，`nvidia-smi` 正常；
 - Docker & Docker Compose v2 已安装；
 - 使用统一账户 `argus`（UID=2133，GID=2015）执行安装，并加入 `docker` 组（如已创建可跳过）：
  ```bash
  sudo groupadd --gid 2015 argus || true
  sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
  sudo passwd argus
  sudo usermod -aG docker argus
  su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
  ```
  后续解压与执行（config/install/uninstall）均使用 `argus` 账户进行。
 - 从 Server 安装方拿到 `cluster-info.env`（包含 `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`；compose 架构下 BINDIP/FTPIP 不再使用）。
 ## 二、解包
 - `tar -xzf client_gpu_YYYYMMDD.tar.gz`
 - 进入目录：`cd client_gpu_YYYYMMDD/`
 - 你应当看到：`images/`（GPU bundle、busybox）、`compose/`、`scripts/`、`docs/`。
 ## 三、配置 config（预热 overlay + 生成 .env）
 命令：
 ```
 cp /path/to/cluster-info.env ./   # 或 export CLUSTER_INFO=/abs/path/cluster-info.env
 ./scripts/config.sh
 ```
 脚本做了什么：
 - 读取 `cluster-info.env` 并 `docker swarm join`（幂等）；
 - 自动用 busybox 预热 external overlay `argus-sys-net`，等待最多 60s 直到本机可见；
 - 生成/更新 `compose/.env`：填入 `SWARM_*`，并“保留你已填写的 AGENT_* 与 GPU_NODE_HOSTNAME”（不会覆盖）。
 看到什么才算成功：
 - 终端输出类似：`已预热 overlay=argus-sys-net 并生成 compose/.env；可执行 scripts/install.sh`；
 - `compose/.env` 至少包含：
  - `AGENT_ENV/AGENT_USER/AGENT_INSTANCE/GPU_NODE_HOSTNAME`（需要你提前填写）；
  - `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`；
 - `NODE_GPU_BUNDLE_IMAGE_TAG=...:YYYYMMDD`。
 ### 日志映射（重要）
 - 容器内 `/logs/infer` 与 `/logs/train` 已映射到包根 `./logs/infer` 与 `./logs/train`：
  - 你可以直接在宿主机查看推理/训练日志：`tail -f logs/infer/*.log`、`tail -f logs/train/*.log`；
  - install 脚本会自动创建这两个目录。
 若提示缺少必填项：
 - 打开 `compose/.env` 按提示补齐 `AGENT_*` 与 `GPU_NODE_HOSTNAME`，再次执行 `./scripts/config.sh`（脚本不会覆盖你已填的值）。
 ## 四、安装 install（加载镜像 + 起容器 + 跟日志）
 命令：
 ```
 ./scripts/install.sh
 ```
 脚本做了什么：
 - 如有必要，先自动预热 overlay；
 - 从 `images/` 导入 `argus-sys-metric-test-node-bundle-gpu-*.tar.gz` 到本地 Docker；
 - `docker compose up -d` 启动 GPU 节点容器，并自动执行 `docker logs -f argus-metric-gpu-node-swarm` 跟踪安装过程。
 看到什么才算成功：
 - 日志中出现：`[BOOT] local bundle install OK: version=...` / `dcgm-exporter ... listening` / `node state present: /private/argus/agent/<hostname>/node.json`；
 - `docker exec argus-metric-gpu-node-swarm nvidia-smi -L` 能列出 GPU；
 - 在 Server 侧 Prometheus `/api/v1/targets` 中，GPU 节点 9100（node-exporter）与 9400（dcgm-exporter）至少其一 up。
 ## 五、卸载 uninstall
 命令：
 ```
 ./scripts/uninstall.sh
 ```
 行为：Compose down（如有 .env），并删除 warmup 容器与节点容器。
 ## 六、常见问题
 - `本机未看到 overlay`：config/install 已自动预热；若仍失败，请检查与 manager 的网络连通性以及 manager 上是否已创建 `argus-sys-net`。
 - `busybox 缺失`：确保包根 `images/busybox.tar` 在，或主机已有 `busybox:latest`。
 - `加入 Swarm 失败`：确认 `cluster-info.env` 的 `SWARM_MANAGER_ADDR` 与 `SWARM_JOIN_TOKEN_WORKER` 正确，或在 manager 上重新 `docker swarm join-token -q worker` 后更新该文件。
--- a/deployment_new/templates/client_gpu/images/busybox.tar
+++ b/deployment_new/templates/client_gpu/images/busybox.tar
--- a/deployment_new/templates/client_gpu/scripts/config.sh
+++ b/deployment_new/templates/client_gpu/scripts/config.sh
@ -0,0 +1,90 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_EX="$PKG_ROOT/compose/.env.example"
 ENV_OUT="$PKG_ROOT/compose/.env"
 info(){ echo -e "\033[34m[CONFIG-GPU]\033[0m $*"; }
 err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
 require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
 # Compose 检测：优先 docker compose（v2），回退 docker-compose（v1）
 require_compose(){
  if docker compose version >/dev/null 2>&1; then return 0; fi
  if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
  err "未检测到 Docker Compose，请安装 docker compose v2 或 docker-compose v1"; exit 1
 }
 require docker curl jq awk sed tar gzip
 require_compose
 # 磁盘空间检查（MB）
 check_disk(){ local p="$1"; local need=10240; local free
  free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
  if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
 }
 check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
 # 导入 cluster-info.env（默认取当前包根，也可用 CLUSTER_INFO 指定路径）
 CI_IN="${CLUSTER_INFO:-$PKG_ROOT/cluster-info.env}"
 info "读取 cluster-info.env: $CI_IN"
 [[ -f "$CI_IN" ]] || { err "找不到 cluster-info.env（默认当前包根，或设置环境变量 CLUSTER_INFO 指定绝对路径）"; exit 1; }
 set -a; source "$CI_IN"; set +a
 [[ -n "${SWARM_MANAGER_ADDR:-}" && -n "${SWARM_JOIN_TOKEN_WORKER:-}" ]] || { err "cluster-info.env 缺少 SWARM 信息（SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_WORKER）"; exit 1; }
 # 加入 Swarm（幂等）
 info "加入 Swarm（幂等）：$SWARM_MANAGER_ADDR"
 docker swarm join --token "$SWARM_JOIN_TOKEN_WORKER" "$SWARM_MANAGER_ADDR":2377 >/dev/null 2>&1 || true
 # 导入 busybox 并做 overlay 预热与连通性（总是执行）
 NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
 # 准备 busybox
 if ! docker image inspect busybox:latest >/dev/null 2>&1; then
  if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then
    info "加载 busybox.tar 以预热 overlay"
    docker load -i "$PKG_ROOT/images/busybox.tar" >/dev/null
  else
    err "缺少 busybox 镜像（包内 images/busybox.tar 或本地 busybox:latest），无法预热 overlay $NET_NAME"; exit 1
  fi
 fi
 # 预热容器（worker 侧加入 overlay 以便本地可见）
 docker rm -f argus-net-warmup >/dev/null 2>&1 || true
 info "启动 warmup 容器加入 overlay: $NET_NAME"
 docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
 for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && { info "overlay 可见 (t=${i}s)"; break; }; sleep 1; done
 docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME；请确认 manager 已创建并网络可达"; exit 1; }
 # 通过 warmup 容器测试实际数据通路（alias → master）
 if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
  err "warmup 容器内无法通过别名访问 master.argus.com；请确认 server compose 已启动并加入 overlay $NET_NAME"
  exit 1
 fi
 info "warmup 容器内可达 master.argus.com（Docker DNS + alias 正常）"
 # 生成/更新 .env（保留人工填写项，不覆盖已有键）
 if [[ ! -f "$ENV_OUT" ]]; then
  cp "$ENV_EX" "$ENV_OUT"
 fi
 set_kv(){ local k="$1" v="$2"; if grep -q "^${k}=" "$ENV_OUT"; then sed -i -E "s#^${k}=.*#${k}=${v}#" "$ENV_OUT"; else echo "${k}=${v}" >> "$ENV_OUT"; fi }
 set_kv SWARM_MANAGER_ADDR "${SWARM_MANAGER_ADDR:-}"
 set_kv SWARM_JOIN_TOKEN_WORKER "${SWARM_JOIN_TOKEN_WORKER:-}"
 set_kv SWARM_JOIN_TOKEN_MANAGER "${SWARM_JOIN_TOKEN_MANAGER:-}"
 REQ_VARS=(AGENT_ENV AGENT_USER AGENT_INSTANCE GPU_NODE_HOSTNAME)
 missing=()
 for v in "${REQ_VARS[@]}"; do
  val=$(grep -E "^$v=" "$ENV_OUT" | head -1 | cut -d= -f2-)
  if [[ -z "$val" ]]; then missing+=("$v"); fi
 done
 if [[ ${#missing[@]} -gt 0 ]]; then
  err "以下变量必须在 compose/.env 中填写：${missing[*]}（已保留你现有的内容，不会被覆盖）"; exit 1; fi
 info "已生成 compose/.env；可执行 scripts/install.sh"
 # 准备并赋权宿主日志目录（幂等，便于安装前人工检查/预创建）
 mkdir -p "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer"
 chmod 1777 "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" || true
 info "日志目录权限（期待 1777，含粘滞位）:"
 stat -c '%a %U:%G %n' "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" 2>/dev/null || true
--- a/deployment_new/templates/client_gpu/scripts/install.sh
+++ b/deployment_new/templates/client_gpu/scripts/install.sh
@ -0,0 +1,72 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_FILE="$PKG_ROOT/compose/.env"
 COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
 info(){ echo -e "\033[34m[INSTALL-GPU]\033[0m $*"; }
 err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
 require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
 # Compose 检测：优先 docker compose（v2），回退 docker-compose（v1）
 require_compose(){
  if docker compose version >/dev/null 2>&1; then return 0; fi
  if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
  err "未检测到 Docker Compose，请安装 docker compose v2 或 docker-compose v1"; exit 1
 }
 require docker nvidia-smi
 require_compose
 [[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env，请先运行 scripts/config.sh"; exit 1; }
 info "使用环境文件: $ENV_FILE"
 # 预热 overlay（当 config 执行很久之前或容器已被清理时，warmup 可能不存在）
 set -a; source "$ENV_FILE"; set +a
 NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
 info "检查 overlay 网络可见性: $NET_NAME"
 if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
  # 如 Overlay 不可见，尝试用 busybox 预热（仅为确保 worker 节点已加入 overlay）
  if ! docker image inspect busybox:latest >/dev/null 2>&1; then
    if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then docker load -i "$PKG_ROOT/images/busybox.tar"; else err "缺少 busybox 镜像（images/busybox.tar 或本地 busybox:latest）"; exit 1; fi
  fi
  docker rm -f argus-net-warmup >/dev/null 2>&1 || true
  docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
  for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && break; sleep 1; done
  docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME；请确认 manager 已创建并网络可达"; exit 1; }
  info "overlay 已可见（warmup=argus-net-warmup）"
 fi
 # 若本函数内重新创建了 warmup 容器，同样测试一次 alias 数据通路
 if docker ps --format '{{.Names}}' | grep -q '^argus-net-warmup$'; then
  if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
    err "GPU install 阶段：warmup 容器内无法通过别名访问 master.argus.com；请检查 overlay $NET_NAME 与 server 状态"
    exit 1
  fi
  info "GPU install 阶段：warmup 容器内可达 master.argus.com"
 fi
 # 导入 GPU bundle 镜像
 IMG_TGZ=$(ls -1 "$PKG_ROOT"/images/argus-sys-metric-test-node-bundle-gpu-*.tar.gz 2>/dev/null | head -1 || true)
 [[ -n "$IMG_TGZ" ]] || { err "找不到 GPU bundle 镜像 tar.gz"; exit 1; }
 info "导入 GPU bundle 镜像: $(basename "$IMG_TGZ")"
 tmp=$(mktemp); gunzip -c "$IMG_TGZ" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
 # 确保日志目录存在（宿主侧，用于映射 /logs/infer 与 /logs/train），并赋权 1777（粘滞位）
 mkdir -p "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train"
 chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
 info "日志目录已准备并赋权 1777: logs/infer logs/train"
 stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
 # 启动 compose 并跟踪日志
 PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
 info "启动 GPU 节点 (docker compose -p $PROJECT up -d)"
 docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
 docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
 # 再次校准宿主日志目录权限，避免容器内脚本对 bind mount 权限回退
 chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
 stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
 info "跟踪节点容器日志（按 Ctrl+C 退出）"
 docker logs -f argus-metric-gpu-node-swarm || true
--- a/deployment_new/templates/client_gpu/scripts/uninstall.sh
+++ b/deployment_new/templates/client_gpu/scripts/uninstall.sh
@ -0,0 +1,36 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_FILE="$PKG_ROOT/compose/.env"
 COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
 # load COMPOSE_PROJECT_NAME if provided in compose/.env
 if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
 PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
 info(){ echo -e "\033[34m[UNINSTALL-GPU]\033[0m $*"; }
 err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
 # Compose 检测：优先 docker compose（v2），回退 docker-compose（v1）
 require_compose(){
  if docker compose version >/dev/null 2>&1; then return 0; fi
  if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
  err "未检测到 Docker Compose，请安装 docker compose v2 或 docker-compose v1"; exit 1
 }
 require_compose
 if [[ -f "$ENV_FILE" ]]; then
  info "stopping compose project (project=$PROJECT)"
  docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
 else
  info "compose/.env not found; attempting to remove container by name"
 fi
 # remove warmup container if still running
 docker rm -f argus-net-warmup >/dev/null 2>&1 || true
 # remove node container if present
 docker rm -f argus-metric-gpu-node-swarm >/dev/null 2>&1 || true
 info "uninstall completed"
--- a/deployment_new/templates/server/compose/docker-compose.yml
+++ b/deployment_new/templates/server/compose/docker-compose.yml
@ -0,0 +1,169 @@
 version: "3.8"
 networks:
  argus-sys-net:
    external: true
 services:
  master:
    image: ${MASTER_IMAGE_TAG:-argus-master:${PKG_VERSION}}
    container_name: argus-master-sys
    environment:
      - OFFLINE_THRESHOLD_SECONDS=180
      - ONLINE_THRESHOLD_SECONDS=120
      - SCHEDULER_INTERVAL_SECONDS=30
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    ports:
      - "${MASTER_PORT:-32300}:3000"
    volumes:
      - ../private/argus/master:/private/argus/master
      - ../private/argus/metric/prometheus:/private/argus/metric/prometheus
      - ../private/argus/etc:/private/argus/etc
    networks:
      argus-sys-net:
        aliases:
          - master.argus.com
    restart: unless-stopped
  es:
    image: ${ES_IMAGE_TAG:-argus-elasticsearch:${PKG_VERSION}}
    container_name: argus-es-sys
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    volumes:
      - ../private/argus/log/elasticsearch:/private/argus/log/elasticsearch
      - ../private/argus/etc:/private/argus/etc
    ports:
      - "${ES_HTTP_PORT:-9200}:9200"
    restart: unless-stopped
    networks:
      argus-sys-net:
        aliases:
          - es.log.argus.com
  kibana:
    image: ${KIBANA_IMAGE_TAG:-argus-kibana:${PKG_VERSION}}
    container_name: argus-kibana-sys
    environment:
      - ELASTICSEARCH_HOSTS=http://es.log.argus.com:9200
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    volumes:
      - ../private/argus/log/kibana:/private/argus/log/kibana
      - ../private/argus/etc:/private/argus/etc
    depends_on: [es]
    ports:
      - "${KIBANA_PORT:-5601}:5601"
    restart: unless-stopped
    networks:
      argus-sys-net:
        aliases:
          - kibana.log.argus.com
  prometheus:
    image: ${PROM_IMAGE_TAG:-argus-metric-prometheus:${PKG_VERSION}}
    container_name: argus-prometheus
    restart: unless-stopped
    environment:
      - TZ=Asia/Shanghai
      - PROMETHEUS_BASE_PATH=/private/argus/metric/prometheus
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    ports:
      - "${PROMETHEUS_PORT:-9090}:9090"
    volumes:
      - ../private/argus/metric/prometheus:/private/argus/metric/prometheus
      - ../private/argus/etc:/private/argus/etc
    networks:
      argus-sys-net:
        aliases:
          - prom.metric.argus.com
  grafana:
    image: ${GRAFANA_IMAGE_TAG:-argus-metric-grafana:${PKG_VERSION}}
    container_name: argus-grafana
    restart: unless-stopped
    environment:
      - TZ=Asia/Shanghai
      - GRAFANA_BASE_PATH=/private/argus/metric/grafana
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
      - GF_SERVER_HTTP_PORT=3000
      - GF_LOG_LEVEL=warn
      - GF_LOG_MODE=console
      - GF_PATHS_PROVISIONING=/private/argus/metric/grafana/provisioning
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
    ports:
      - "${GRAFANA_PORT:-3000}:3000"
    volumes:
      - ../private/argus/metric/grafana:/private/argus/metric/grafana
      - ../private/argus/etc:/private/argus/etc
    depends_on: [prometheus]
    networks:
      argus-sys-net:
        aliases:
          - grafana.metric.argus.com
  alertmanager:
    image: ${ALERT_IMAGE_TAG:-argus-alertmanager:${PKG_VERSION}}
    container_name: argus-alertmanager
    environment:
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    volumes:
      - ../private/argus/etc:/private/argus/etc
      - ../private/argus/alert/alertmanager:/private/argus/alert/alertmanager
    networks:
      argus-sys-net:
        aliases:
          - alertmanager.alert.argus.com
    ports:
      - "${ALERTMANAGER_PORT:-9093}:9093"
    restart: unless-stopped
  web-frontend:
    image: ${FRONT_IMAGE_TAG:-argus-web-frontend:${PKG_VERSION}}
    container_name: argus-web-frontend
    environment:
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
      - EXTERNAL_MASTER_PORT=${WEB_PROXY_PORT_8085:-8085}
      - EXTERNAL_ALERTMANAGER_PORT=${WEB_PROXY_PORT_8084:-8084}
      - EXTERNAL_GRAFANA_PORT=${WEB_PROXY_PORT_8081:-8081}
      - EXTERNAL_PROMETHEUS_PORT=${WEB_PROXY_PORT_8082:-8082}
      - EXTERNAL_KIBANA_PORT=${WEB_PROXY_PORT_8083:-8083}
    volumes:
      - ../private/argus/etc:/private/argus/etc
    networks:
      argus-sys-net:
        aliases:
          - web.argus.com
    restart: unless-stopped
  web-proxy:
    image: ${WEB_PROXY_IMAGE_TAG:-argus-web-proxy:${PKG_VERSION}}
    container_name: argus-web-proxy
    depends_on: [master, grafana, prometheus, kibana, alertmanager]
    environment:
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    volumes:
      - ../private/argus/etc:/private/argus/etc
    networks:
      argus-sys-net:
        aliases:
          - proxy.argus.com
    ports:
      - "${WEB_PROXY_PORT_8080:-8080}:8080"
      - "${WEB_PROXY_PORT_8081:-8081}:8081"
      - "${WEB_PROXY_PORT_8082:-8082}:8082"
      - "${WEB_PROXY_PORT_8083:-8083}:8083"
      - "${WEB_PROXY_PORT_8084:-8084}:8084"
      - "${WEB_PROXY_PORT_8085:-8085}:8085"
    restart: unless-stopped
--- a/deployment_new/templates/server/docs/INSTALL_SERVER_zh.md
+++ b/deployment_new/templates/server/docs/INSTALL_SERVER_zh.md
@ -0,0 +1,102 @@
 # Argus Server 安装指南（deployment_new）
 适用：通过 Server 安装包在 Docker Swarm + external overlay 网络一体化部署 Argus 服务端组件。
 —— 本文强调“怎么做、看什么、符合了才继续”。
 ## 一、准备条件（开始前确认）
 - Docker 与 Docker Compose v2 已安装；`docker info` 正常；`docker compose version` 可执行。
 - 具备 root/sudo 权限；磁盘可用空间 ≥ 10GB（包根与 `/var/lib/docker`）。
 - 你知道本机管理地址（SWARM_MANAGER_ADDR），该 IP 属于本机某网卡，可被其他节点访问。
 - 很重要：以统一账户 `argus`（UID=2133，GID=2015）执行后续安装与运维，并将其加入 `docker` 组；示例命令如下（如需不同 UID/GID，请替换为贵方标准）：
  ```bash
  # 1) 创建主组（GID=2015，组名 argus；若已存在可跳过）
  sudo groupadd --gid 2015 argus || true
  # 2) 创建用户 argus（UID=2133、主组 GID=2015，创建家目录并用 bash 作为默认 shell；若已存在可用 usermod 调整）
  sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
  sudo passwd argus
  # 3) 将 argus 加入 docker 组，使其能调用 Docker Daemon（新登录后生效）
  sudo usermod -aG docker argus
  # 4) 验证（重新登录或执行 newgrp docker 使组生效）
  su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
  ```
  后续的解压与执行（config/install/selfcheck 等）均使用该 `argus` 账户进行。
 ## 二、解包与目录结构
 - 解压：`tar -xzf server_YYYYMMDD.tar.gz`。
 - 进入：`cd server_YYYYMMDD/`
 - 你应当能看到：
  - `images/`（逐服务镜像 tar.gz，如 `argus-master-YYYYMMDD.tar.gz`）
  - `compose/`（`docker-compose.yml` 与 `.env.example`）
  - `scripts/`（安装/运维脚本）
  - `private/argus/`（数据与配置骨架）
  - `docs/`（中文文档）
 ## 三、配置 config（生成 .env 与 SWARM_MANAGER_ADDR）
 命令：
 ```
 export SWARM_MANAGER_ADDR=<本机管理IP>
 ./scripts/config.sh
 ```
 脚本做了什么：
 - 检查依赖与磁盘空间；
 - 自动从“端口 20000 起”分配所有服务端口，确保“系统未占用”且“彼此不冲突”；
 - 写入 `compose/.env`（包含端口、镜像 tag、FTP 账号、overlay 名称等）；
 - 将当前执行账户的 UID/GID 写入 `ARGUS_BUILD_UID/GID`（若主组名是 docker，会改用“与用户名同名的组”的 GID，避免拿到 docker 组 999）；
 - 更新/追加 `cluster-info.env` 中的 `SWARM_MANAGER_ADDR`（不会覆盖其他键）。
 看到什么才算成功：
 - 终端输出：`已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。`
 - `compose/.env` 打开应当看到：
  - 端口均 ≥20000 且没有重复；
  - `ARGUS_BUILD_UID/GID` 与 `id -u/-g` 一致；
  - `SWARM_MANAGER_ADDR=<你的IP>`。
 遇到问题：
 - 端口被异常占用：可删去 `.env` 后再次执行 `config.sh`，或手工编辑端口再执行 `install.sh`。
 ## 四、安装 install（一次到位）
 命令：
 ```
 ./scripts/install.sh
 ```
 脚本做了什么：
 - 若 Swarm 未激活：执行 `docker swarm init --advertise-addr $SWARM_MANAGER_ADDR`；
 - 确保 external overlay `argus-sys-net` 存在；
 - 导入 `images/*.tar.gz` 到本机 Docker；
 - `docker compose up -d` 启动服务；
 - 等待“六项就绪”：
  - Master `/readyz`=200、ES `/_cluster/health`=200、Prometheus TCP 可达、Grafana `/api/health`=200、Alertmanager `/api/v2/status`=200、Kibana `/api/status` level=available；
 - 将各服务 overlay IP 写入 `private/argus/etc/<域名>`，Reload Bind9 与 Nginx；
 - 写出 `cluster-info.env`（含 `SWARM_JOIN_TOKEN_{WORKER,MANAGER}/SWARM_MANAGER_ADDR`；compose 架构下不再依赖 BINDIP/FTPIP）；
 - 生成 `安装报告_YYYYMMDD-HHMMSS.md`（端口、健康检查摘要与提示）。
 看到什么才算成功：
 - `docker compose ps` 全部是 Up；
 - `安装报告_…md` 中各项 HTTP 检查为 200/available；
 - `cluster-info.env` 包含五个关键键：
  - `SWARM_MANAGER_ADDR=...`
  - `SWARM_MANAGER_ADDR=...` `SWARM_JOIN_TOKEN_*=...`
  - `SWARM_JOIN_TOKEN_WORKER=SWMTKN-...`
  - `SWARM_JOIN_TOKEN_MANAGER=SWMTKN-...`
 ## 五、健康自检与常用操作
 - 健康自检：`./scripts/selfcheck.sh`
  - 期望输出：`selfcheck OK -> logs/selfcheck.json`
  - 文件 `logs/selfcheck.json` 中 `overlay_net/es/kibana/master_readyz/ftp_share_writable/prometheus/grafana/alertmanager/web_proxy_cors` 为 true。
 - 状态：`./scripts/status.sh`（相当于 `docker compose ps`）。
 - 诊断：`./scripts/diagnose.sh`（收集容器/HTTP/CORS/ES 细节，输出到 `logs/diagnose_*.log`）。
 - 卸载：`./scripts/uninstall.sh`（Compose down）。
 - ES 磁盘水位临时放宽/还原：`./scripts/es-watermark-relax.sh` / `./scripts/es-watermark-restore.sh`。
 ## 六、下一步：分发 cluster-info.env 给 Client
 - 将 `cluster-info.env` 拷贝给安装 Client 的同事；
 - 对方在 Client 机器的包根放置该文件（或设置 `CLUSTER_INFO=/绝对路径`）即可。
 ## 七、故障排查快览
 - Proxy 502 或 8080 连接复位：多因 Bind 域名未更新到 overlay IP；重跑 `install.sh`（会写入私有域名文件并 reload）或查看 `logs/diagnose_error.log`。
 - Kibana 不 available：等待 1–2 分钟、查看 `argus-kibana-sys` 日志；
 - cluster-info.env 的 SWARM_MANAGER_ADDR 为空：重新 `export SWARM_MANAGER_ADDR=<IP>; ./scripts/config.sh` 或 `./scripts/install.sh`（会回读 `.env` 补写）。
--- a/deployment_new/templates/server/docs/SWARM_DEPLOY_zh.md
+++ b/deployment_new/templates/server/docs/SWARM_DEPLOY_zh.md
@ -0,0 +1,7 @@
 # Docker Swarm 部署要点
 - 初始化 Swarm：`docker swarm init --advertise-addr <SWARM_MANAGER_ADDR>`
 - 创建 overlay：`docker network create --driver overlay --attachable argus-sys-net`
 - Server 包 `install.sh` 自动完成上述操作；如需手动执行，确保 `argus-sys-net` 存在且 attachable。
 - Worker 节点加入：`docker swarm join --token <worker_token> <SWARM_MANAGER_ADDR>:2377`。
--- a/deployment_new/templates/server/docs/TROUBLESHOOTING_zh.md
+++ b/deployment_new/templates/server/docs/TROUBLESHOOTING_zh.md
@ -0,0 +1,11 @@
 # 故障排查（Server）
 - 端口占用：查看 `安装报告_*.md` 中端口表；如需修改，编辑 `compose/.env` 后执行 `docker compose ... up -d`。
 - 组件未就绪：
  - Master: `curl http://127.0.0.1:${MASTER_PORT}/readyz -I`
  - ES: `curl http://127.0.0.1:${ES_HTTP_PORT}/_cluster/health`
  - Grafana: `curl http://127.0.0.1:${GRAFANA_PORT}/api/health`
  - Prometheus TCP: `exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT}`
 - 域名解析：进入 `argus-web-proxy` 或 `argus-master-sys` 容器：`getent hosts master.argus.com`。
 - Swarm/Overlay：检查 `docker network ls | grep argus-sys-net`，或 `docker node ls`。
--- a/deployment_new/templates/server/scripts/config.sh
+++ b/deployment_new/templates/server/scripts/config.sh
@ -0,0 +1,111 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_EX="$PKG_ROOT/compose/.env.example"
 ENV_OUT="$PKG_ROOT/compose/.env"
 info(){ echo -e "\033[34m[CONFIG]\033[0m $*"; }
 err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
 require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
 # Compose 检测：优先 docker compose（v2），回退 docker-compose（v1）
 require_compose(){
  if docker compose version >/dev/null 2>&1; then return 0; fi
  if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
  err "未检测到 Docker Compose，请安装 docker compose v2 或 docker-compose v1"; exit 1
 }
 require docker curl jq awk sed tar gzip
 require_compose
 # 磁盘空间检查（MB）
 check_disk(){ local p="$1"; local need=10240; local free
  free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
  if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
 }
 check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
 # 读取/生成 SWARM_MANAGER_ADDR
 SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}
 if [[ -z "${SWARM_MANAGER_ADDR}" ]]; then
  read -rp "请输入本机管理地址 SWARM_MANAGER_ADDR: " SWARM_MANAGER_ADDR
 fi
 info "SWARM_MANAGER_ADDR=$SWARM_MANAGER_ADDR"
 # 校验 IP 属于本机网卡
 if ! ip -o addr | awk '{print $4}' | cut -d'/' -f1 | grep -qx "$SWARM_MANAGER_ADDR"; then
  err "SWARM_MANAGER_ADDR 非本机地址: $SWARM_MANAGER_ADDR"; exit 1; fi
 info "开始分配服务端口（起始=20000，避免系统占用与相互冲突）"
 is_port_used(){ local p="$1"; ss -tulnH 2>/dev/null | awk '{print $5}' | sed 's/.*://g' | grep -qx "$p"; }
 declare -A PRESENT=() CHOSEN=() USED=()
 START_PORT="${START_PORT:-20000}"; cur=$START_PORT
 ORDER=(MASTER_PORT ES_HTTP_PORT KIBANA_PORT PROMETHEUS_PORT GRAFANA_PORT ALERTMANAGER_PORT \
       WEB_PROXY_PORT_8080 WEB_PROXY_PORT_8081 WEB_PROXY_PORT_8082 WEB_PROXY_PORT_8083 WEB_PROXY_PORT_8084 WEB_PROXY_PORT_8085 \
       FTP_PORT FTP_DATA_PORT)
 # 标记 .env.example 中实际存在的键
 for key in "${ORDER[@]}"; do
  if grep -q "^${key}=" "$ENV_EX"; then PRESENT[$key]=1; fi
 done
 next_free(){ local p="$1"; while :; do if [[ -n "${USED[$p]:-}" ]] || is_port_used "$p"; then p=$((p+1)); else echo "$p"; return; fi; done; }
 for key in "${ORDER[@]}"; do
  [[ -z "${PRESENT[$key]:-}" ]] && continue
  p=$(next_free "$cur"); CHOSEN[$key]="$p"; USED[$p]=1; cur=$((p+1))
 done
 info "端口分配结果：MASTER=${CHOSEN[MASTER_PORT]:-} ES=${CHOSEN[ES_HTTP_PORT]:-} KIBANA=${CHOSEN[KIBANA_PORT]:-} PROM=${CHOSEN[PROMETHEUS_PORT]:-} GRAFANA=${CHOSEN[GRAFANA_PORT]:-} ALERT=${CHOSEN[ALERTMANAGER_PORT]:-} WEB_PROXY(8080..8085)=${CHOSEN[WEB_PROXY_PORT_8080]:-}/${CHOSEN[WEB_PROXY_PORT_8081]:-}/${CHOSEN[WEB_PROXY_PORT_8082]:-}/${CHOSEN[WEB_PROXY_PORT_8083]:-}/${CHOSEN[WEB_PROXY_PORT_8084]:-}/${CHOSEN[WEB_PROXY_PORT_8085]:-}"
 cp "$ENV_EX" "$ENV_OUT"
 # 覆盖端口（按唯一化结果写回）
 for key in "${ORDER[@]}"; do
  val="${CHOSEN[$key]:-}"
  [[ -z "$val" ]] && continue
  sed -i -E "s#^$key=.*#$key=${val}#" "$ENV_OUT"
 done
 info "已写入 compose/.env 的端口配置"
 # 覆盖/补充 Overlay 名称
 grep -q '^ARGUS_OVERLAY_NET=' "$ENV_OUT" || echo 'ARGUS_OVERLAY_NET=argus-sys-net' >> "$ENV_OUT"
 # FTP 默认
 grep -q '^FTP_USER=' "$ENV_OUT" || echo 'FTP_USER=ftpuser' >> "$ENV_OUT"
 grep -q '^FTP_PASSWORD=' "$ENV_OUT" || echo 'FTP_PASSWORD=NASPlab1234!' >> "$ENV_OUT"
 # 以当前执行账户 UID/GID 写入（避免误选 docker 组）
 RUID=$(id -u)
 PRIMARY_GID=$(id -g)
 PRIMARY_GRP=$(id -gn)
 USER_NAME=$(id -un)
 # 若主组名被解析为 docker，尝试用与用户名同名的组的 GID；否则回退主 GID
 if [[ "$PRIMARY_GRP" == "docker" ]]; then
  RGID=$(getent group "$USER_NAME" | awk -F: '{print $3}' 2>/dev/null || true)
  [[ -z "$RGID" ]] && RGID="$PRIMARY_GID"
 else
  RGID="$PRIMARY_GID"
 fi
 info "使用构建账户 UID:GID=${RUID}:${RGID} (user=$USER_NAME primary_group=$PRIMARY_GRP)"
 if grep -q '^ARGUS_BUILD_UID=' "$ENV_OUT"; then
  sed -i -E "s#^ARGUS_BUILD_UID=.*#ARGUS_BUILD_UID=${RUID}#" "$ENV_OUT"
 else
  echo "ARGUS_BUILD_UID=${RUID}" >> "$ENV_OUT"
 fi
 if grep -q '^ARGUS_BUILD_GID=' "$ENV_OUT"; then
  sed -i -E "s#^ARGUS_BUILD_GID=.*#ARGUS_BUILD_GID=${RGID}#" "$ENV_OUT"
 else
  echo "ARGUS_BUILD_GID=${RGID}" >> "$ENV_OUT"
 fi
 CI="$PKG_ROOT/cluster-info.env"
 if [[ -f "$CI" ]]; then
  if grep -q '^SWARM_MANAGER_ADDR=' "$CI"; then
    sed -i -E "s#^SWARM_MANAGER_ADDR=.*#SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}#" "$CI"
  else
    echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" >> "$CI"
  fi
 else
  echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" > "$CI"
 fi
 info "已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。"
 info "下一步可执行: scripts/install.sh"
--- a/deployment_new/templates/server/scripts/diagnose.sh
+++ b/deployment_new/templates/server/scripts/diagnose.sh
@ -0,0 +1,114 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
 ts="$(date -u +%Y%m%d-%H%M%SZ)"
 LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
 if ! ( : > "$LOG_DIR/.w" 2>/dev/null ); then LOG_DIR="/tmp/argus-logs"; mkdir -p "$LOG_DIR" || true; fi
 # load compose project for accurate ps output
 ENV_FILE="$ROOT/compose/.env"
 if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
 PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
 DETAILS="$LOG_DIR/diagnose_details_${ts}.log"; ERRORS="$LOG_DIR/diagnose_error_${ts}.log"; : > "$DETAILS"; : > "$ERRORS"
 logd() { echo "$(date '+%F %T') $*" >> "$DETAILS"; }
 append_err() { echo "$*" >> "$ERRORS"; }
 http_code() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
 http_body_head() { curl -s --max-time 3 "$1" 2>/dev/null | sed -n '1,5p' || true; }
 header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
 section() { local name="$1"; logd "===== [$name] ====="; }
 svc() {
  local svc_name="$1"; local cname="$2"; shift 2
  section "$svc_name ($cname)"
  logd "docker ps:"; docker ps -a --format '{{.Names}}\t{{.Status}}\t{{.Image}}' | awk -v n="$cname" '$1==n' >> "$DETAILS" || true
  logd "docker inspect:"; docker inspect -f '{{.State.Status}} rc={{.RestartCount}} started={{.State.StartedAt}}' "$cname" >> "$DETAILS" 2>&1 || true
  logd "last 200 container logs:"; docker logs --tail 200 "$cname" >> "$DETAILS" 2>&1 || true
  docker logs --tail 200 "$cname" 2>&1 | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][container] /" >> "$ERRORS" || true
  if docker exec "$cname" sh -lc 'command -v supervisorctl >/dev/null 2>&1' >/dev/null 2>&1; then
    logd "supervisorctl status:"; docker exec "$cname" sh -lc 'supervisorctl status' >> "$DETAILS" 2>&1 || true
    local files; files=$(docker exec "$cname" sh -lc 'ls /var/log/supervisor/*.log 2>/dev/null' || true)
    for f in $files; do
      logd "tail -n 80 $f:"; docker exec "$cname" sh -lc "tail -n 80 $f 2>/dev/null || true" >> "$DETAILS" 2>&1 || true
      docker exec "$cname" sh -lc "tail -n 200 $f 2>/dev/null" 2>/dev/null | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][supervisor:$(basename $f)] /" >> "$ERRORS" || true
    done
  fi
 }
 svc bind argus-bind-sys
 svc master argus-master-sys
 svc es argus-es-sys
 svc kibana argus-kibana-sys
 svc ftp argus-ftp
 svc prometheus argus-prometheus
 svc grafana argus-grafana
 svc alertmanager argus-alertmanager
 svc web-frontend argus-web-frontend
 svc web-proxy argus-web-proxy
 section HTTP
 logd "ES: $(http_code \"http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health\")"; http_body_head "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" >> "$DETAILS" 2>&1 || true
 logd "Kibana: $(http_code \"http://localhost:${KIBANA_PORT:-5601}/api/status\")"; http_body_head "http://localhost:${KIBANA_PORT:-5601}/api/status" >> "$DETAILS" 2>&1 || true
 logd "Master readyz: $(http_code \"http://localhost:${MASTER_PORT:-32300}/readyz\")"
 logd "Prometheus: $(http_code \"http://localhost:${PROMETHEUS_PORT:-9090}/-/ready\")"
 logd "Grafana: $(http_code \"http://localhost:${GRAFANA_PORT:-3000}/api/health\")"; http_body_head "http://localhost:${GRAFANA_PORT:-3000}/api/health" >> "$DETAILS" 2>&1 || true
 logd "Alertmanager: $(http_code \"http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status\")"
 cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
 cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
 logd "Web-Proxy 8080: $(http_code \"http://localhost:${WEB_PROXY_PORT_8080:-8080}/\")"
 logd "Web-Proxy 8083: $(http_code \"http://localhost:${WEB_PROXY_PORT_8083:-8083}/\")"
 logd "Web-Proxy 8084 CORS: ${cors8084}"
 logd "Web-Proxy 8085 CORS: ${cors8085}"
 section ES-CHECKS
 ch=$(curl -s --max-time 3 "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" || true)
 status=$(printf '%s' "$ch" | awk -F'\"' '/"status"/{print $4; exit}')
 if [[ -n "$status" ]]; then logd "cluster.status=$status"; fi
 if [[ "$status" != "green" ]]; then append_err "[es][cluster] status=$status"; fi
 if docker ps --format '{{.Names}}' | grep -q '^argus-es-sys$'; then
  duse=$(docker exec argus-es-sys sh -lc 'df -P /usr/share/elasticsearch/data | awk "NR==2{print \$5}"' 2>/dev/null || true)
  logd "es.data.df_use=$duse"; usep=${duse%%%}
  if [[ -n "$usep" ]] && (( usep >= 90 )); then append_err "[es][disk] data path usage=${usep}%"; fi
 fi
 section DNS-IN-PROXY
 for d in master.argus.com es.log.argus.com kibana.log.argus.com grafana.metric.argus.com prom.metric.argus.com alertmanager.alert.argus.com; do
  docker exec argus-web-proxy sh -lc "getent hosts $d || nslookup $d 2>/dev/null | tail -n+1" >> "$DETAILS" 2>&1 || true
 done
 logd "HTTP (web-proxy): master.readyz=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://master.argus.com:3000/readyz\" 2>/dev/null || echo 000)"
 logd "HTTP (web-proxy): es.health=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://es.log.argus.com:9200/_cluster/health\" 2>/dev/null || echo 000)"
 logd "HTTP (web-proxy): kibana.status=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://kibana.log.argus.com:5601/api/status\" 2>/dev/null || echo 000)"
 section FTP-SHARE
 docker exec argus-ftp sh -lc 'ls -ld /private/argus/ftp /private/argus/ftp/share; test -w /private/argus/ftp/share && echo "write:OK" || echo "write:FAIL"' >> "$DETAILS" 2>&1 || true
 section SYSTEM
 logd "uname -a:"; uname -a >> "$DETAILS"
 logd "docker version:"; docker version --format '{{.Server.Version}}' >> "$DETAILS" 2>&1 || true
 logd "compose ps (project=$PROJECT):"; (cd "$ROOT/compose" && docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f docker-compose.yml ps) >> "$DETAILS" 2>&1 || true
 section SUMMARY
 [[ $(http_code "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health") != 200 ]] && echo "[es][http] health not 200" >> "$ERRORS"
 kbcode=$(http_code "http://localhost:${KIBANA_PORT:-5601}/api/status"); [[ "$kbcode" != 200 ]] && echo "[kibana][http] /api/status=$kbcode" >> "$ERRORS"
 [[ $(http_code "http://localhost:${MASTER_PORT:-32300}/readyz") != 200 ]] && echo "[master][http] /readyz not 200" >> "$ERRORS"
 [[ $(http_code "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready") != 200 ]] && echo "[prometheus][http] /-/ready not 200" >> "$ERRORS"
 gfcode=$(http_code "http://localhost:${GRAFANA_PORT:-3000}/api/health"); [[ "$gfcode" != 200 ]] && echo "[grafana][http] /api/health=$gfcode" >> "$ERRORS"
 [[ $(http_code "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status") != 200 ]] && echo "[alertmanager][http] /api/v2/status not 200" >> "$ERRORS"
 [[ -z "$cors8084" ]] && echo "[web-proxy][cors] 8084 missing Access-Control-Allow-Origin" >> "$ERRORS"
 [[ -z "$cors8085" ]] && echo "[web-proxy][cors] 8085 missing Access-Control-Allow-Origin" >> "$ERRORS"
 sort -u -o "$ERRORS" "$ERRORS"
 echo "Diagnostic details -> $DETAILS"
 echo "Detected errors   -> $ERRORS"
 if [[ "$LOG_DIR" == "$ROOT/logs" ]]; then
  ln -sfn "$(basename "$DETAILS")" "$ROOT/logs/diagnose_details.log" 2>/dev/null || cp "$DETAILS" "$ROOT/logs/diagnose_details.log" 2>/dev/null || true
  ln -sfn "$(basename "$ERRORS")"  "$ROOT/logs/diagnose_error.log"   2>/dev/null || cp "$ERRORS"  "$ROOT/logs/diagnose_error.log"   2>/dev/null || true
 fi
 exit 0
--- a/deployment_new/templates/server/scripts/es-watermark-relax.sh
+++ b/deployment_new/templates/server/scripts/es-watermark-relax.sh
@ -0,0 +1,11 @@
 #!/usr/bin/env bash
 set -euo pipefail
 HOST="${1:-http://127.0.0.1:9200}"
 echo "设置 ES watermark 为 95%/96%/97%: $HOST"
 curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "95%",
    "cluster.routing.allocation.disk.watermark.high": "96%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
  }
 }' && echo "\nOK"
--- a/deployment_new/templates/server/scripts/es-watermark-restore.sh
+++ b/deployment_new/templates/server/scripts/es-watermark-restore.sh
@ -0,0 +1,11 @@
 #!/usr/bin/env bash
 set -euo pipefail
 HOST="${1:-http://127.0.0.1:9200}"
 echo "恢复 ES watermark 为默认值: $HOST"
 curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": null,
    "cluster.routing.allocation.disk.watermark.high": null,
    "cluster.routing.allocation.disk.watermark.flood_stage": null
  }
 }' && echo "\nOK"
--- a/deployment_new/templates/server/scripts/install.sh
+++ b/deployment_new/templates/server/scripts/install.sh
@ -0,0 +1,137 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_FILE="$PKG_ROOT/compose/.env"
 COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
 info(){ echo -e "\033[34m[INSTALL]\033[0m $*"; }
 err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
 require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
 # Compose 检测：优先 docker compose（v2），回退 docker-compose（v1）
 require_compose(){
  if docker compose version >/dev/null 2>&1; then return 0; fi
  if command -v docker-compose >/devnull 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
  err "未检测到 Docker Compose，请安装 docker compose v2 或 docker-compose v1"; exit 1
 }
 require docker curl jq awk sed tar gzip
 require_compose
 [[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env，请先运行 scripts/config.sh"; exit 1; }
 info "使用环境文件: $ENV_FILE"
 set -a; source "$ENV_FILE"; set +a
 # 兼容：若 .env 未包含 SWARM_MANAGER_ADDR，则从已存在的 cluster-info.env 读取以避免写空
 SMADDR="${SWARM_MANAGER_ADDR:-}"
 CI_FILE="$PKG_ROOT/cluster-info.env"
 if [[ -z "$SMADDR" && -f "$CI_FILE" ]]; then
  SMADDR=$(sed -n 's/^SWARM_MANAGER_ADDR=\(.*\)$/\1/p' "$CI_FILE" | head -n1)
 fi
 SWARM_MANAGER_ADDR="$SMADDR"
 # Swarm init & overlay
 if ! docker info 2>/dev/null | grep -q "Swarm: active"; then
  [[ -n "${SWARM_MANAGER_ADDR:-}" ]] || { err "SWARM_MANAGER_ADDR 未设置，请在 scripts/config.sh 中配置"; exit 1; }
  info "初始化 Swarm (--advertise-addr $SWARM_MANAGER_ADDR)"
  docker swarm init --advertise-addr "$SWARM_MANAGER_ADDR" >/dev/null 2>&1 || true
 else
  info "Swarm 已激活"
 fi
 NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
 if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
  info "创建 overlay 网络: $NET_NAME"
  docker network create -d overlay --attachable "$NET_NAME" >/dev/null
 else
  info "overlay 网络已存在: $NET_NAME"
 fi
 # Load images
 IMAGES_DIR="$PKG_ROOT/images"
 shopt -s nullglob
 tars=("$IMAGES_DIR"/*.tar.gz)
 if [[ ${#tars[@]} -eq 0 ]]; then err "images 目录为空，缺少镜像 tar.gz"; exit 1; fi
 total=${#tars[@]}; idx=0
 for tgz in "${tars[@]}"; do
  idx=$((idx+1))
  info "导入镜像 ($idx/$total): $(basename "$tgz")"
  tmp=$(mktemp); gunzip -c "$tgz" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
 done
 shopt -u nullglob
 # Compose up
 PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
 info "启动服务栈 (docker compose -p $PROJECT up -d)"
 docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
 docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
 # Wait readiness (best-effort)
 code(){ curl -4 -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
 prom_ok(){ (exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT:-9090}) >/dev/null 2>&1 && return 0 || return 1; }
 kb_ok(){ local body; body=$(curl -s "http://127.0.0.1:${KIBANA_PORT:-5601}/api/status" || true); echo "$body" | grep -q '"level"\s*:\s*"available"'; }
 RETRIES=${RETRIES:-60}; SLEEP=${SLEEP:-5}; ok=0
 info "等待基础服务就绪 (<= $((RETRIES*SLEEP))s)"
 for i in $(seq 1 "$RETRIES"); do
  e1=$(code "http://127.0.0.1:${MASTER_PORT:-32300}/readyz")
  e2=$(code "http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health")
  e3=000; prom_ok && e3=200
  e4=$(code "http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health")
  e5=$(code "http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status")
  e6=$(kb_ok && echo 200 || echo 000)
  info "[ready] t=$((i*SLEEP))s master=$e1 es=$e2 prom=$e3 graf=$e4 alert=$e5 kibana=$e6"
  [[ "$e1" == 200 ]] && ok=$((ok+1))
  [[ "$e2" == 200 ]] && ok=$((ok+1))
  [[ "$e3" == 200 ]] && ok=$((ok+1))
  [[ "$e4" == 200 ]] && ok=$((ok+1))
  [[ "$e5" == 200 ]] && ok=$((ok+1))
  [[ "$e6" == 200 ]] && ok=$((ok+1))
  if [[ $ok -ge 6 ]]; then break; fi; ok=0; sleep "$SLEEP"
 done
 [[ $ok -ge 6 ]] || err "部分服务未就绪（可稍后重试 selfcheck）"
 # Swarm join tokens
 TOKEN_WORKER=$(docker swarm join-token -q worker 2>/dev/null || echo "")
 TOKEN_MANAGER=$(docker swarm join-token -q manager 2>/dev/null || echo "")
 # cluster-info.env（compose 场景下不再依赖 BINDIP/FTPIP）
 CI="$PKG_ROOT/cluster-info.env"
 info "写入 cluster-info.env (manager/token)"
 {
  echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
  echo "SWARM_JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
  echo "SWARM_JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
 } > "$CI"
 info "已输出 $CI"
 # 安装报告
 ts=$(date +%Y%m%d-%H%M%S)
 RPT="$PKG_ROOT/安装报告_${ts}.md"
 {
  echo "# Argus Server 安装报告 (${ts})"
  echo
  echo "## 端口映射"
  echo "- MASTER_PORT=${MASTER_PORT}"
  echo "- ES_HTTP_PORT=${ES_HTTP_PORT}"
  echo "- KIBANA_PORT=${KIBANA_PORT}"
  echo "- PROMETHEUS_PORT=${PROMETHEUS_PORT}"
  echo "- GRAFANA_PORT=${GRAFANA_PORT}"
  echo "- ALERTMANAGER_PORT=${ALERTMANAGER_PORT}"
  echo "- WEB_PROXY_PORT_8080=${WEB_PROXY_PORT_8080} ... 8085=${WEB_PROXY_PORT_8085}"
  echo
  echo "## Swarm/Overlay"
  echo "- SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}" 
  echo "- NET=${NET_NAME}"
  echo "- JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
  echo "- JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
  echo
  echo "## 健康检查（简要）"
  echo "- master/readyz=$(code http://127.0.0.1:${MASTER_PORT:-32300}/readyz)"
  echo "- es/_cluster/health=$(code http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health)"
  echo "- grafana/api/health=$(code http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health)"
  echo "- prometheus/tcp=$([[ $(prom_ok; echo $?) == 0 ]] && echo 200 || echo 000)"
  echo "- alertmanager/api/v2/status=$(code http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status)"
  echo "- kibana/api/status=$([[ $(kb_ok; echo $?) == 0 ]] && echo available || echo not-ready)"
 } > "$RPT"
 info "已生成报告: $RPT"
 info "安装完成。可将 cluster-info.env 分发给 Client-GPU 安装方。"
 docker exec argus-web-proxy nginx -t >/dev/null 2>&1 && docker exec argus-web-proxy nginx -s reload >/dev/null 2>&1 || true
--- a/deployment_new/templates/server/scripts/selfcheck.sh
+++ b/deployment_new/templates/server/scripts/selfcheck.sh
@ -0,0 +1,89 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 log() { echo -e "\033[0;34m[CHECK]\033[0m $*"; }
 err() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }
 ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
 wait_http() { local url="$1"; local attempts=${2:-120}; local i=1; while ((i<=attempts)); do curl -fsS "$url" >/dev/null 2>&1 && return 0; echo "[..] waiting $url ($i/$attempts)"; sleep 5; ((i++)); done; return 1; }
 code_for() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
 header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
 LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
 OUT_JSON="$LOG_DIR/selfcheck.json"; tmp=$(mktemp)
 ok=1
 log "checking overlay network"
 net_ok=false
 if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" >/dev/null 2>&1; then
  if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" | grep -q '"Driver": "overlay"'; then net_ok=true; fi
 fi
 [[ "$net_ok" == true ]] || ok=0
 log "checking Elasticsearch"
 wait_http "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" 60 || ok=0
 log "checking Kibana"
 kb_code=$(code_for "http://localhost:${KIBANA_PORT:-5601}/api/status")
 kb_ok=false
 if [[ "$kb_code" == 200 ]]; then
  body=$(curl -sS "http://localhost:${KIBANA_PORT:-5601}/api/status" || true)
  echo "$body" | grep -q '"level"\s*:\s*"available"' && kb_ok=true
 fi
 [[ "$kb_ok" == true ]] || ok=0
 log "checking Master"
 [[ $(code_for "http://localhost:${MASTER_PORT:-32300}/readyz") == 200 ]] || ok=0
 log "checking FTP"
 if docker ps --format '{{.Names}}' | grep -q '^argus-ftp$'; then
  docker exec argus-ftp sh -lc 'test -w /private/argus/ftp/share' >/dev/null 2>&1 || ok=0
 else ok=0; fi
 log "checking Prometheus"
 wait_http "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready" 60 || ok=0
 log "checking Grafana"
 gf_code=$(code_for "http://localhost:${GRAFANA_PORT:-3000}/api/health")
 gf_ok=false; if [[ "$gf_code" == 200 ]]; then body=$(curl -sS "http://localhost:${GRAFANA_PORT:-3000}/api/health" || true); echo "$body" | grep -q '"database"\s*:\s*"ok"' && gf_ok=true; fi
 [[ "$gf_ok" == true ]] || ok=0
 log "checking Alertmanager"
 wait_http "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status" 60 || ok=0
 log "checking Web-Proxy (CORS)"
 cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
 cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
 wp_ok=true
 [[ -n "$cors8084" && -n "$cors8085" ]] || wp_ok=false
 [[ "$wp_ok" == true ]] || ok=0
 cat > "$tmp" <<JSON
 {
  "overlay_net": $net_ok,
  "es": true,
  "kibana": $kb_ok,
  "master_readyz": true,
  "ftp_share_writable": true,
  "prometheus": true,
  "grafana": $gf_ok,
  "alertmanager": true,
  "web_proxy_cors": $wp_ok,
  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
 }
 JSON
 mv "$tmp" "$OUT_JSON" 2>/dev/null || cp "$tmp" "$OUT_JSON"
 if [[ "$ok" == 1 ]]; then
  log "selfcheck OK -> $OUT_JSON"
  exit 0
 else
  err "selfcheck FAILED -> $OUT_JSON"
  exit 1
 fi
--- a/deployment_new/templates/server/scripts/status.sh
+++ b/deployment_new/templates/server/scripts/status.sh
@ -0,0 +1,9 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_FILE="$PKG_ROOT/compose/.env"
 COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
 if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
 PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
 docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
--- a/deployment_new/templates/server/scripts/uninstall.sh
+++ b/deployment_new/templates/server/scripts/uninstall.sh
@ -0,0 +1,23 @@
 #!/usr/bin/env bash
 set -euo pipefail
 ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 PKG_ROOT="$ROOT_DIR"
 ENV_FILE="$PKG_ROOT/compose/.env"
 COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
 # load COMPOSE_PROJECT_NAME from env file if present
 if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
 PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
 err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
 # Compose 检测：优先 docker compose（v2），回退 docker-compose（v1）
 require_compose(){
  if docker compose version >/dev/null 2>&1; then return 0; fi
  if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
  err "未检测到 Docker Compose，请安装 docker compose v2 或 docker-compose v1"; exit 1
 }
 require_compose
 echo "[UNINSTALL] stopping compose (project=$PROJECT)"
 docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
 echo "[UNINSTALL] done"
--- a/doc/build-user-config.md
+++ b/doc/build-user-config.md
@ -0,0 +1,38 @@
 # Argus 镜像构建 UID/GID 配置说明
 通过统一配置文件可以为 Kibana、Elasticsearch、Bind、Master 等容器指定运行账号，解决跨机器部署时 UID/GID 不一致导致的权限问题。
 ## 配置入口
 - 默认配置存放在 `configs/build_user.conf`，内容示例：
  ```bash
  UID=2133
  GID=2015
  ```
 - 如果需要本地覆盖，可在 `configs/` 下新建 `build_user.local.conf`，字段与默认文件一致。该文件已列入 `.gitignore`，不会被意外提交。
 - 亦可在执行脚本前通过环境变量 `ARGUS_BUILD_UID` / `ARGUS_BUILD_GID` 强制指定值，优先级最高。
 ## 作用范围
 - `build/build_images.sh` 在构建 log/bind/master 镜像时读取配置，并传递 `--build-arg ARGUS_BUILD_UID/GID`；控制台会输出当前使用的 UID/GID。
 - `src/master/scripts/build_images.sh` 同步使用配置，确保单独构建 master 镜像时行为一致。
 - 各镜像 Dockerfile 会根据传入的 UID/GID 调整容器内账号（如 `elasticsearch`、`kibana`、`bind`、`argus`），并以环境变量形式暴露运行时可见值。
 - Master 启动脚本会在执行 DNS 逻辑后，降权到配置的账号运行 `gunicorn`，确保写入 `/private/argus/**` 的文件具备正确属主。
 - Log 模块测试脚本 `01_bootstrap.sh` 会根据配置修正挂载目录属主，方便端到端测试在任意用户下运行。
 ## 使用建议
 1. 初次克隆仓库后无需修改，默认 UID/GID 保持向后兼容。
 2. 如果在目标环境中使用新的账号（例如 `uid=4001,gid=4001`）：
   - 编辑 `configs/build_user.local.conf` 填入新值；
   - 使用新账号登录，并确保其加入宿主机的 `docker` 组；
   - 重新执行 `build/build_images.sh` 或相关模块的构建脚本。
 3. 切换配置后建议重新运行目标模块的端到端脚本（如 `src/log/tests/scripts/01_bootstrap.sh`、`src/master/tests/scripts/00_e2e_test.sh`、`src/agent/tests/scripts/00_e2e_test.sh`），验证 `/private/argus` 下文件属主是否为期望账号。
 ## 故障排查
 - **镜像构建报错 `groupmod: GID already in use`**：说明所选 GID 已存在于基础镜像，建议换用未占用的值，或在自定义基础镜像中先移除冲突。
 - **容器内运行时报写权限不足**：检查宿主机挂载目录是否已经由目标 UID/GID 创建；必要时重新执行模块的 `01_bootstrap.sh` 之类的准备脚本。
 - **仍看到旧 UID/GID**：确认脚本执行时未继承旧缓存，可运行 `ARGUS_BUILD_UID=... ARGUS_BUILD_GID=... ./build/build_images.sh` 强制覆盖。
--- a/doc/metric_lists.xlsx
+++ b/doc/metric_lists.xlsx
--- a/scripts/common/build_user.sh
+++ b/scripts/common/build_user.sh
@ -0,0 +1,120 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # Shared helper to load Argus build user/group configuration.
 # Usage:
 #   source "${PROJECT_ROOT}/scripts/common/build_user.sh"
 #   load_build_user
 #   echo "$ARGUS_BUILD_UID:$ARGUS_BUILD_GID"
 ARGUS_BUILD_UID_DEFAULT=2133
 ARGUS_BUILD_GID_DEFAULT=2015
 shopt -s extglob
 _ARGUS_BUILD_USER_LOADED="${_ARGUS_BUILD_USER_LOADED:-0}"
 _argus_build_user_script_dir() {
  local dir
  dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
  echo "$dir"
 }
 argus_project_root() {
  local script_dir
  script_dir="$(_argus_build_user_script_dir)"
  (cd "$script_dir/../.." >/dev/null && pwd)
 }
 _argus_trim() {
  local value="$1"
  value="${value##+([[:space:]])}"
  value="${value%%+([[:space:]])}"
  printf '%s' "$value"
 }
 _argus_is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 _argus_read_user_from_files() {
  local uid_out_var="$1" gid_out_var="$2"; shift 2
  local uid_val="$ARGUS_BUILD_UID_DEFAULT" gid_val="$ARGUS_BUILD_GID_DEFAULT"
  local config
  for config in "$@"; do
    if [[ -f "$config" ]]; then
      while IFS= read -r raw_line || [[ -n "$raw_line" ]]; do
        local line key value
        line="${raw_line%%#*}"
        line="$(_argus_trim "${line}")"
        [[ -z "$line" ]] && continue
        if [[ "$line" != *=* ]]; then
          echo "[ARGUS build_user] Ignoring malformed line in $config: $raw_line" >&2
          continue
        fi
        key="${line%%=*}"
        value="${line#*=}"
        key="$(_argus_trim "$key")"
        value="$(_argus_trim "$value")"
        case "$key" in
          UID) uid_val="$value" ;;
          GID) gid_val="$value" ;;
          *) echo "[ARGUS build_user] Unknown key '$key' in $config" >&2 ;;
        esac
      done < "$config"
      break
    fi
  done
  printf -v "$uid_out_var" '%s' "$uid_val"
  printf -v "$gid_out_var" '%s' "$gid_val"
 }
 load_build_user_profile() {
  local profile="${1:-default}"
  if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
    return 0
  fi
  local project_root uid gid
  project_root="$(argus_project_root)"
  case "$profile" in
    pkg)
      _argus_read_user_from_files uid gid \
        "$project_root/configs/build_user.pkg.conf" \
        "$project_root/configs/build_user.local.conf" \
        "$project_root/configs/build_user.conf"
      ;;
    default|*)
      _argus_read_user_from_files uid gid \
        "$project_root/configs/build_user.local.conf" \
        "$project_root/configs/build_user.conf"
      ;;
  esac
  if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then uid="$ARGUS_BUILD_UID"; fi
  if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then gid="$ARGUS_BUILD_GID"; fi
  if ! _argus_is_number "$uid"; then
    echo "[ARGUS build_user] Invalid UID '$uid'" >&2; return 1
  fi
  if ! _argus_is_number "$gid"; then
    echo "[ARGUS build_user] Invalid GID '$gid'" >&2; return 1
  fi
  export ARGUS_BUILD_UID="$uid"
  export ARGUS_BUILD_GID="$gid"
  _ARGUS_BUILD_USER_LOADED=1
 }
 load_build_user() {
  local profile="${ARGUS_BUILD_PROFILE:-default}"
  load_build_user_profile "$profile"
 }
 argus_build_user_args() {
  load_build_user
  printf '%s' "--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} --build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID}"
 }
 print_build_user() {
  load_build_user
  echo "ARGUS build user: UID=${ARGUS_BUILD_UID} GID=${ARGUS_BUILD_GID}"
 }
--- a/src/.gitignore
+++ b/src/.gitignore
@ -0,0 +1,2 @@
 __pycache__/
--- a/src/agent/.gitignore
+++ b/src/agent/.gitignore
@ -0,0 +1,6 @@
 build/
 *.egg-info/
 __pycache__/
 .env
 dist/
--- a/src/agent/README.md
+++ b/src/agent/README.md
@ -0,0 +1,78 @@
 # Argus Agent 模块
 Argus Agent 是一个轻量级 Python 进程，负责向 Argus Master 注册节点、汇报健康数据，并维护本地持久化信息。模块现以 PyInstaller 打包为独立可执行文件，便于在普通容器或虚机中直接运行。
 ## 构建可执行文件
 ```bash
 cd src/agent
 ./scripts/build_binary.sh  # 生成 dist/argus-agent
 ```
 脚本默认会在 Docker 容器 (`python:3.11-slim-bullseye`) 内执行 PyInstaller，确保产物运行时兼容 glibc 2.31+（覆盖 2.35 环境）。构建流程注意事项：
 - 每次构建前会清理 `build/`、`dist/` 并在容器内重新创建虚拟环境。
 - 需要使用内网 Python 镜像时，可通过 `PIP_INDEX_URL`、`PIP_EXTRA_INDEX_URL`、`PIP_TRUSTED_HOST` 等环境变量传入，脚本会自动透传给容器。
 - 如果宿主机无法运行 Docker，可设置 `AGENT_BUILD_USE_DOCKER=0` 回退到本地构建；此时代码必须在 glibc ≤ 2.35 的机器上执行。
 构建结束后脚本会在 `build/compat_check/` 下解包关键动态库并输出最高 `GLIBC_x.y` 版本，便于快速核对兼容性。如果结果中缺少 `libssl.so.3` / `libcrypto.so.3`，表示系统会在目标宿主机上使用本地 OpenSSL 库，无需额外处理。
 例如：
 ```bash
 strings build/compat_check/libpython*.so.1.0 | grep -Eo 'GLIBC_[0-9]+\.[0-9]+' | sort -Vu | tail -n1
 ```
 如遇构建失败，常见原因是 Docker 不可用（请改用 `AGENT_BUILD_USE_DOCKER=0`）或无法访问 Python 包镜像（先设置上述镜像环境变量后重试）。
 ## 运行时配置
 Agent 不再依赖配置文件；所有参数均由环境变量与主机名推导：
 | 变量 | 必填 | 默认值 | 说明 |
 | --- | --- | --- | --- |
 | `MASTER_ENDPOINT` | 是 | N/A | Master 基础地址，可写 `http://host:3000` 或 `host:3000`（自动补全 `http://`）。 |
 | `REPORT_INTERVAL_SECONDS` | 否 | `60` | 状态上报间隔（秒）。必须为正整数。 |
 | `AGENT_HOSTNAME` | 否 | `$(hostname)` | 覆盖容器内主机名，便于测试或特殊命名需求。 |
 | `AGENT_ENV` | 否 | 来源于主机名 | 运行环境标识（如 `dev`、`prod`）。与 `AGENT_USER`、`AGENT_INSTANCE` 必须同时设置。 |
 | `AGENT_USER` | 否 | 来源于主机名 | 归属用户或团队标识。与 `AGENT_ENV`、`AGENT_INSTANCE` 必须同时设置。 |
 | `AGENT_INSTANCE` | 否 | 来源于主机名 | 实例编号或别名。与 `AGENT_ENV`、`AGENT_USER` 必须同时设置。 |
 主机名与元数据的解析优先级：
 1. 若设置 `AGENT_ENV` / `AGENT_USER` / `AGENT_INSTANCE` 且全部存在，则直接使用这些值。
 2. 否则检查历史 `node.json`（注册成功后由 Master 返回的信息），若包含 `env` / `user` / `instance` 则沿用。
 3. 若以上均不可用，则按历史约定从主机名解析 `env-user-instance` 前缀。
 4. 如果仍无法得到完整结果，Agent 启动会失败并提示需要提供上述环境变量。
 > 提示：在首次部署时需确保环境变量或主机名能够提供完整信息。完成注册后，Agent 会把 Master 返回的元数据写入 `node.json`，后续重启无需再次提供环境变量就能保持一致性。
 派生路径：
 - 节点信息：`/private/argus/agent/<hostname>/node.json`
 - 子模块健康目录：`/private/argus/agent/<hostname>/health/`
 健康目录中的文件需遵循 `<模块前缀>-*.json` 命名（例如 `log-fluentbit.json`、`metric-node-exporter.json`），文件内容会原样并入上报的 `health` 字段。
 ## 日志与持久化
 - Agent 会在成功注册、状态上报、异常重试等关键节点输出结构化日志，便于聚合分析。
 - `node.json` 保存 Master 返回的最新节点对象，用于重启后继续使用既有节点 ID。
 ## 端到端测试
 仓库内提供 Docker Compose 测试栈（master + ubuntu 容器）：
 ```bash
 cd src/agent/tests
 ./scripts/00_e2e_test.sh
 ```
 测试脚本会：
 1. 构建 master 镜像与 agent 可执行文件。
 2. 以 `ubuntu:24.04` 启动 agent 容器，并通过环境变量注入 `MASTER_ENDPOINT`、`REPORT_INTERVAL_SECONDS`。
 3. 验证注册、健康上报、nodes.json 生成、统计接口，以及“容器重启 + IP 变化”重注册流程。
 4. 清理 `tests/private/` 与临时容器网络。
 如需在真实环境部署，只需将 `dist/argus-agent` 连同健康目录挂载到目标主机，并按上表设置环境变量即可。
--- a/src/agent/app/init.py
+++ b/src/agent/app/init.py
--- a/src/agent/app/client.py
+++ b/src/agent/app/client.py
@ -0,0 +1,60 @@
 from __future__ import annotations
 import json
 from typing import Any, Dict, Optional
 import requests
 from .log import get_logger
 LOGGER = get_logger("argus.agent.client")
 class MasterAPIError(Exception):
    def __init__(self, message: str, status_code: int, payload: Optional[Dict[str, Any]] = None) -> None:
        super().__init__(message)
        self.status_code = status_code
        self.payload = payload or {}
 class AgentClient:
    def __init__(self, base_url: str, *, timeout: int = 10) -> None:
        self._base_url = base_url.rstrip("/")
        self._timeout = timeout
        self._session = requests.Session()
    def register_node(self, body: Dict[str, Any]) -> Dict[str, Any]:
        """调用 master 注册接口，返回节点对象。"""
        url = f"{self._base_url}/api/v1/master/nodes"
        response = self._session.post(url, json=body, timeout=self._timeout)
        return self._parse_response(response, "Failed to register node")
    def update_status(self, node_id: str, body: Dict[str, Any]) -> Dict[str, Any]:
        """上报健康信息，由 master 更新 last_report。"""
        url = f"{self._base_url}/api/v1/master/nodes/{node_id}/status"
        response = self._session.put(url, json=body, timeout=self._timeout)
        return self._parse_response(response, "Failed to update node status")
    def _parse_response(self, response: requests.Response, error_prefix: str) -> Dict[str, Any]:
        content_type = response.headers.get("Content-Type", "")
        payload: Dict[str, Any] | None = None
        if "application/json" in content_type:
            try:
                payload = response.json()
            except json.JSONDecodeError:
                LOGGER.warning("Response contained invalid JSON", extra={"status": response.status_code})
        if response.status_code >= 400:
            message = payload.get("error") if isinstance(payload, dict) else response.text
            raise MasterAPIError(
                f"{error_prefix}: {message}",
                status_code=response.status_code,
                payload=payload if isinstance(payload, dict) else None,
            )
        if payload is None:
            try:
                payload = response.json()
            except json.JSONDecodeError as exc:
                raise MasterAPIError("Master returned non-JSON payload", response.status_code) from exc
        return payload
--- a/src/agent/app/collector.py
+++ b/src/agent/app/collector.py
@ -0,0 +1,262 @@
 from __future__ import annotations
 import os
 import re
 import socket
 import subprocess
 import ipaddress
 from pathlib import Path
 from typing import Any, Dict
 from .config import AgentConfig
 from .log import get_logger
 LOGGER = get_logger("argus.agent.collector")
 _HOSTNAME_PATTERN = re.compile(r"^([^-]+)-([^-]+)-([^-]+)-.*$")
 def collect_metadata(config: AgentConfig) -> Dict[str, Any]:
    """汇总节点注册需要的静态信息，带有更智能的 IP 选择。
    规则（从高到低）：
    1) AGENT_PUBLISH_IP 指定；
    2) Hostname A 记录（若命中优先网段）；
    3) 网卡扫描：排除 AGENT_EXCLUDE_IFACES，优先 AGENT_PREFER_NET_CIDRS；
    4) 默认路由回退（UDP socket 技巧）。
    额外发布：overlay_ip / gwbridge_ip / interfaces，便于 Master 与诊断使用。
    """
    hostname = config.hostname
    prefer_cidrs = _read_cidrs_env(
        os.environ.get("AGENT_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16")
    )
    exclude_ifaces = _read_csv_env(
        os.environ.get("AGENT_EXCLUDE_IFACES", "docker_gwbridge,lo")
    )
    # interface inventory
    interfaces = _list_global_ipv4_addrs()
    if exclude_ifaces:
        interfaces = [it for it in interfaces if it[0] not in set(exclude_ifaces)]
    # resolve hostname candidates
    host_ips = _resolve_hostname_ips(hostname)
    selected_ip, overlay_ip, gwbridge_ip = _select_publish_ips(
        interfaces=interfaces,
        host_ips=host_ips,
        prefer_cidrs=prefer_cidrs,
    )
    meta: Dict[str, Any] = {
        "hostname": hostname,
        "ip": os.environ.get("AGENT_PUBLISH_IP", selected_ip),  # keep required field
        "overlay_ip": overlay_ip or selected_ip,
        "gwbridge_ip": gwbridge_ip,
        "interfaces": [
            {"iface": name, "ip": ip} for name, ip in interfaces
        ],
        "env": config.environment,
        "user": config.user,
        "instance": config.instance,
        "cpu_number": _detect_cpu_count(),
        "memory_in_bytes": _detect_memory_bytes(),
        "gpu_number": _detect_gpu_count(),
    }
    return meta
 def _parse_hostname(hostname: str) -> tuple[str, str, str]:
    """按照约定的 env-user-instance 前缀拆解主机名。"""
    match = _HOSTNAME_PATTERN.match(hostname)
    if not match:
        LOGGER.warning("Hostname does not match expected pattern", extra={"hostname": hostname})
        return "", "", ""
    return match.group(1), match.group(2), match.group(3)
 def _detect_cpu_count() -> int:
    count = os.cpu_count()
    return count if count is not None else 0
 def _detect_memory_bytes() -> int:
    """优先读取 cgroup 限额，失败时退回 /proc/meminfo。"""
    cgroup_path = Path("/sys/fs/cgroup/memory.max")
    try:
        raw = cgroup_path.read_text(encoding="utf-8").strip()
        if raw and raw != "max":
            return int(raw)
    except FileNotFoundError:
        LOGGER.debug("cgroup memory.max not found, falling back to /proc/meminfo")
    except ValueError:
        LOGGER.warning("Failed to parse memory.max, falling back", extra={"value": raw})
    try:
        with open("/proc/meminfo", "r", encoding="utf-8") as handle:
            for line in handle:
                if line.startswith("MemTotal:"):
                    parts = line.split()
                    if len(parts) >= 2:
                        return int(parts[1]) * 1024
    except FileNotFoundError:
        LOGGER.error("/proc/meminfo not found; defaulting memory to 0")
    return 0
 def _detect_gpu_count() -> int:
    """采集 GPU 数量，如无法探测则默认为 0。"""
    try:
        proc = subprocess.run(
            ["nvidia-smi", "-L"],
            check=False,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            timeout=5,
        )
    except FileNotFoundError:
        LOGGER.debug("nvidia-smi not available; assuming 0 GPUs")
        return 0
    except subprocess.SubprocessError as exc:
        LOGGER.warning("nvidia-smi invocation failed", extra={"error": str(exc)})
        return 0
    if proc.returncode != 0:
        LOGGER.debug("nvidia-smi returned non-zero", extra={"stderr": proc.stderr.strip()})
        return 0
    count = sum(1 for line in proc.stdout.splitlines() if line.strip())
    return count
 def _detect_ip_address() -> str:
    """保留旧接口，作为最终回退：默认路由源地址 → 主机名解析 → 127.0.0.1。"""
    try:
        with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
            sock.connect(("8.8.8.8", 80))
            return sock.getsockname()[0]
    except OSError:
        LOGGER.debug("UDP socket trick failed; falling back to hostname lookup")
    try:
        return socket.gethostbyname(socket.gethostname())
    except OSError:
        LOGGER.warning("Unable to resolve hostname to IP; defaulting to 127.0.0.1")
        return "127.0.0.1"
 def _read_csv_env(raw: str | None) -> list[str]:
    if not raw:
        return []
    return [x.strip() for x in raw.split(",") if x.strip()]
 def _read_cidrs_env(raw: str | None) -> list[ipaddress.IPv4Network]:
    cidrs: list[ipaddress.IPv4Network] = []
    for item in _read_csv_env(raw):
        try:
            net = ipaddress.ip_network(item, strict=False)
            if isinstance(net, (ipaddress.IPv4Network,)):
                cidrs.append(net)
        except ValueError:
            LOGGER.warning("Ignoring invalid CIDR in AGENT_PREFER_NET_CIDRS", extra={"cidr": item})
    return cidrs
 def _list_global_ipv4_addrs() -> list[tuple[str, str]]:
    """列出 (iface, ip) 形式的全局 IPv4 地址。
    依赖 iproute2：ip -4 -o addr show scope global
    """
    results: list[tuple[str, str]] = []
    try:
        proc = subprocess.run(
            ["sh", "-lc", "ip -4 -o addr show scope global | awk '{print $2, $4}'"],
            check=False,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            timeout=3,
        )
        if proc.returncode == 0:
            for line in proc.stdout.splitlines():
                line = line.strip()
                if not line:
                    continue
                parts = line.split()
                if len(parts) != 2:
                    continue
                iface, cidr = parts
                ip = cidr.split("/")[0]
                try:
                    ipaddress.IPv4Address(ip)
                except ValueError:
                    continue
                results.append((iface, ip))
    except Exception as exc:  # pragma: no cover - defensive
        LOGGER.debug("Failed to list interfaces", extra={"error": str(exc)})
    return results
 def _resolve_hostname_ips(name: str) -> list[str]:
    ips: list[str] = []
    try:
        infos = socket.getaddrinfo(name, None, family=socket.AF_INET)
        for info in infos:
            ip = info[4][0]
            if ip not in ips:
                ips.append(ip)
    except OSError:
        pass
    return ips
 def _pick_by_cidrs(candidates: list[str], prefer_cidrs: list[ipaddress.IPv4Network]) -> str | None:
    for net in prefer_cidrs:
        for ip in candidates:
            try:
                if ipaddress.ip_address(ip) in net:
                    return ip
            except ValueError:
                continue
    return None
 def _select_publish_ips(
    *,
    interfaces: list[tuple[str, str]],
    host_ips: list[str],
    prefer_cidrs: list[ipaddress.IPv4Network],
 ) -> tuple[str, str | None, str | None]:
    """返回 (selected_ip, overlay_ip, gwbridge_ip)。
    - overlay_ip：优先命中 prefer_cidrs（10.0/8 先于 172.31/16）。
    - gwbridge_ip：若存在 172.22/16 则记录。
    - selected_ip：优先 AGENT_PUBLISH_IP；否则 overlay_ip；否则 hostname A 记录中的 prefer；否则默认路由回退。
    """
    # detect gwbridge (172.22/16)
    gwbridge_net = ipaddress.ip_network("172.22.0.0/16")
    gwbridge_ip = None
    for _, ip in interfaces:
        try:
            if ipaddress.ip_address(ip) in gwbridge_net:
                gwbridge_ip = ip
                break
        except ValueError:
            continue
    # overlay candidate from interfaces by prefer cidrs
    iface_ips = [ip for _, ip in interfaces]
    overlay_ip = _pick_by_cidrs(iface_ips, prefer_cidrs)
    # hostname A records filtered by prefer cidrs
    host_pref = _pick_by_cidrs(host_ips, prefer_cidrs)
    env_ip = os.environ.get("AGENT_PUBLISH_IP")
    if env_ip:
        selected = env_ip
    else:
        selected = overlay_ip or host_pref or _detect_ip_address()
    return selected, overlay_ip, gwbridge_ip
--- a/src/agent/app/config.py
+++ b/src/agent/app/config.py
@ -0,0 +1,141 @@
 from __future__ import annotations
 import os
 import socket
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Final
 from .state import load_node_state
 from .version import VERSION
 from .log import get_logger
 DEFAULT_REPORT_INTERVAL_SECONDS: Final[int] = 60
 LOGGER = get_logger("argus.agent.config")
@dataclass(frozen=True)
 class AgentConfig:
    hostname: str
    environment: str
    user: str
    instance: str
    node_file: str
    version: str
    master_endpoint: str
    report_interval_seconds: int
    health_dir: str
    request_timeout_seconds: int = 10
 def _normalise_master_endpoint(value: str) -> str:
    value = value.strip()
    if not value:
        raise ValueError("MASTER_ENDPOINT environment variable is required")
    if not value.startswith("http://") and not value.startswith("https://"):
        value = f"http://{value}"
    return value.rstrip("/")
 def _read_report_interval(raw_value: str | None) -> int:
    if raw_value is None or raw_value.strip() == "":
        return DEFAULT_REPORT_INTERVAL_SECONDS
    try:
        interval = int(raw_value)
    except ValueError as exc:
        raise ValueError("REPORT_INTERVAL_SECONDS must be an integer") from exc
    if interval <= 0:
        raise ValueError("REPORT_INTERVAL_SECONDS must be positive")
    return interval
 def _resolve_hostname() -> str:
    return os.environ.get("AGENT_HOSTNAME") or socket.gethostname()
 def _load_metadata_from_state(node_file: str) -> tuple[str, str, str] | None:
    state = load_node_state(node_file)
    if not state:
        return None
    meta = state.get("meta_data") or {}
    env = meta.get("env") or state.get("env")
    user = meta.get("user") or state.get("user")
    instance = meta.get("instance") or state.get("instance")
    if env and user and instance:
        LOGGER.debug("Metadata resolved from node state", extra={"node_file": node_file})
        return env, user, instance
    LOGGER.warning(
        "node.json missing metadata fields; ignoring",
        extra={"node_file": node_file, "meta_data": meta},
    )
    return None
 def _resolve_metadata_fields(hostname: str, node_file: str) -> tuple[str, str, str]:
    env = os.environ.get("AGENT_ENV")
    user = os.environ.get("AGENT_USER")
    instance = os.environ.get("AGENT_INSTANCE")
    if env and user and instance:
        return env, user, instance
    if any([env, user, instance]):
        LOGGER.warning(
            "Incomplete metadata environment variables; falling back to persisted metadata",
            extra={
                "has_env": bool(env),
                "has_user": bool(user),
                "has_instance": bool(instance),
            },
        )
    state_metadata = _load_metadata_from_state(node_file)
    if state_metadata is not None:
        return state_metadata
    from .collector import _parse_hostname  # Local import to avoid circular dependency
    env, user, instance = _parse_hostname(hostname)
    if not all([env, user, instance]):
        raise ValueError(
            "Failed to determine metadata fields; set AGENT_ENV/USER/INSTANCE or use supported hostname pattern"
        )
    return env, user, instance
 def load_config() -> AgentConfig:
    """从环境变量推导配置，移除了外部配置文件依赖。"""
    hostname = _resolve_hostname()
    node_file = f"/private/argus/agent/{hostname}/node.json"
    environment, user, instance = _resolve_metadata_fields(hostname, node_file)
    health_dir = f"/private/argus/agent/{hostname}/health/"
    master_endpoint_env = os.environ.get("MASTER_ENDPOINT")
    if master_endpoint_env is None:
        raise ValueError("MASTER_ENDPOINT environment variable is not set")
    master_endpoint = _normalise_master_endpoint(master_endpoint_env)
    report_interval_seconds = _read_report_interval(os.environ.get("REPORT_INTERVAL_SECONDS"))
    Path(node_file).parent.mkdir(parents=True, exist_ok=True)
    Path(health_dir).mkdir(parents=True, exist_ok=True)
    return AgentConfig(
        hostname=hostname,
        environment=environment,
        user=user,
        instance=instance,
        node_file=node_file,
        version=VERSION,
        master_endpoint=master_endpoint,
        report_interval_seconds=report_interval_seconds,
        health_dir=health_dir,
    )
--- a/src/agent/app/health_reader.py
+++ b/src/agent/app/health_reader.py
@ -0,0 +1,32 @@
 from __future__ import annotations
 import json
 from pathlib import Path
 from typing import Any, Dict
 from .log import get_logger
 LOGGER = get_logger("argus.agent.health")
 def read_health_directory(path: str) -> Dict[str, Any]:
    """读取目录中所有 <prefix>-*.json 文件并返回 JSON 映射。"""
    result: Dict[str, Any] = {}
    directory = Path(path)
    if not directory.exists():
        LOGGER.debug("Health directory does not exist", extra={"path": str(directory)})
        return result
    for health_file in sorted(directory.glob("*.json")):
        if "-" not in health_file.stem:
            LOGGER.debug("Skipping non-prefixed health file", extra={"file": health_file.name})
            continue
        try:
            with health_file.open("r", encoding="utf-8") as handle:
                content = json.load(handle)
            result[health_file.stem] = content
        except json.JSONDecodeError as exc:
            LOGGER.warning("Failed to parse health file", extra={"file": health_file.name, "error": str(exc)})
        except OSError as exc:
            LOGGER.warning("Failed to read health file", extra={"file": health_file.name, "error": str(exc)})
    return result
--- a/src/agent/app/log.py
+++ b/src/agent/app/log.py
@ -0,0 +1,18 @@
 from __future__ import annotations
 import logging
 import os
 _LOG_FORMAT = "%(asctime)s %(levelname)s %(name)s - %(message)s"
 def setup_logging() -> None:
    level_name = os.environ.get("AGENT_LOG_LEVEL", "INFO").upper()
    level = getattr(logging, level_name, logging.INFO)
    logging.basicConfig(level=level, format=_LOG_FORMAT)
 def get_logger(name: str) -> logging.Logger:
    setup_logging()
    return logging.getLogger(name)
--- a/src/agent/app/main.py
+++ b/src/agent/app/main.py
@ -0,0 +1,163 @@
 from __future__ import annotations
 import signal
 import time
 from datetime import datetime, timezone
 from typing import Optional
 from .client import AgentClient, MasterAPIError
 from .collector import collect_metadata
 from .config import AgentConfig, load_config
 from .health_reader import read_health_directory
 from .log import get_logger, setup_logging
 from .state import clear_node_state, load_node_state, save_node_state
 LOGGER = get_logger("argus.agent")
 def _current_timestamp() -> str:
    return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
 class StopSignal:
    def __init__(self) -> None:
        self._stop = False
    def set(self, *_args) -> None:  # type: ignore[override]
        self._stop = True
    def is_set(self) -> bool:
        return self._stop
 def main(argv: Optional[list[str]] = None) -> int:  # noqa: ARG001 - 保留签名以兼容入口调用
    setup_logging()
    stop_signal = StopSignal()
    signal.signal(signal.SIGTERM, stop_signal.set)
    signal.signal(signal.SIGINT, stop_signal.set)
    try:
        config = load_config()
    except Exception as exc:
        LOGGER.error("Failed to load configuration", extra={"error": str(exc)})
        return 1
    LOGGER.info(
        "Agent starting",
        extra={
            "hostname": config.hostname,
            "master_endpoint": config.master_endpoint,
            "node_file": config.node_file,
        },
    )
    client = AgentClient(config.master_endpoint, timeout=config.request_timeout_seconds)
    node_state = load_node_state(config.node_file) or {}
    node_id = node_state.get("id")
    # 与 master 建立注册关系（支持重注册），失败则重试
    register_response = _register_with_retry(client, config, node_id, stop_signal)
    if register_response is None:
        LOGGER.info("Registration aborted due to shutdown signal")
        return 0
    node_id = register_response.get("id")
    if not node_id:
        LOGGER.error("Master did not return node id; aborting")
        return 1
    save_node_state(config.node_file, register_response)
    LOGGER.info("Entering status report loop", extra={"node_id": node_id})
    _status_loop(client, config, node_id, stop_signal)
    return 0
 def _register_with_retry(
    client: AgentClient,
    config: AgentConfig,
    node_id: Optional[str],
    stop_signal: StopSignal,
 ):
    backoff = 5
    while not stop_signal.is_set():
        payload = {
            "name": config.hostname,
            "type": "agent",
            "meta_data": collect_metadata(config),
            "version": config.version,
        }
        if node_id:
            payload["id"] = node_id
        try:
            response = client.register_node(payload)
            LOGGER.info("Registration successful", extra={"node_id": response.get("id")})
            save_node_state(config.node_file, response)
            return response
        except MasterAPIError as exc:
            if exc.status_code == 404 and node_id:
                LOGGER.warning(
                    "Master does not recognise node id; clearing local node state",
                    extra={"node_id": node_id},
                )
                clear_node_state(config.node_file)
                node_id = None
            elif exc.status_code == 500 and node_id:
                # id 与 name 不匹配通常意味着配置异常，记录但继续重试
                LOGGER.error(
                    "Master rejected node due to id/name mismatch; will retry",
                    extra={"node_id": node_id},
                )
            else:
                LOGGER.error("Registration failed", extra={"status_code": exc.status_code, "error": str(exc)})
            time.sleep(min(backoff, 60))
            backoff = min(backoff * 2, 60)
        except Exception as exc:  # pragma: no cover - defensive
            LOGGER.exception("Unexpected error during registration", extra={"error": str(exc)})
            time.sleep(min(backoff, 60))
            backoff = min(backoff * 2, 60)
    return None
 def _status_loop(
    client: AgentClient,
    config: AgentConfig,
    node_id: str,
    stop_signal: StopSignal,
 ) -> None:
    interval = config.report_interval_seconds
    while not stop_signal.is_set():
        timestamp = _current_timestamp()
        health_payload = read_health_directory(config.health_dir)
        body = {
            "timestamp": timestamp,
            "health": health_payload,
        }
        try:
            response = client.update_status(node_id, body)
            LOGGER.info(
                "Status report succeeded",
                extra={"node_id": node_id, "health_keys": list(health_payload.keys())},
            )
            save_node_state(config.node_file, response)
        except MasterAPIError as exc:
            # 保持循环继续执行，等待下一次重试
            LOGGER.error(
                "Failed to report status",
                extra={"status_code": exc.status_code, "error": str(exc)},
            )
        except Exception as exc:  # pragma: no cover - defensive
            LOGGER.exception("Unexpected error during status report", extra={"error": str(exc)})
        for _ in range(interval):
            if stop_signal.is_set():
                break
            time.sleep(1)
    LOGGER.info("Stop signal received; exiting status loop")
 if __name__ == "__main__":
    sys.exit(main())
--- a/src/agent/app/state.py
+++ b/src/agent/app/state.py
@ -0,0 +1,44 @@
 from __future__ import annotations
 import json
 import os
 import tempfile
 from pathlib import Path
 from typing import Any, Dict, Optional
 from .log import get_logger
 LOGGER = get_logger("argus.agent.state")
 def load_node_state(path: str) -> Optional[Dict[str, Any]]:
    """读取本地 node.json，容器重启后沿用之前的 ID。"""
    try:
        with open(path, "r", encoding="utf-8") as handle:
            return json.load(handle)
    except FileNotFoundError:
        return None
    except json.JSONDecodeError as exc:
        LOGGER.warning("node.json is invalid JSON; ignoring", extra={"error": str(exc)})
        return None
 def save_node_state(path: str, data: Dict[str, Any]) -> None:
    """原子化写入 node.json，避免并发读取坏数据。"""
    directory = Path(path).parent
    directory.mkdir(parents=True, exist_ok=True)
    with tempfile.NamedTemporaryFile("w", dir=directory, delete=False, encoding="utf-8") as tmp:
        json.dump(data, tmp, separators=(",", ":"))
        tmp.flush()
        os.fsync(tmp.fileno())
        temp_path = tmp.name
    os.replace(temp_path, path)
 def clear_node_state(path: str) -> None:
    try:
        os.remove(path)
    except FileNotFoundError:
        return
    except OSError as exc:
        LOGGER.warning("Failed to remove node state file", extra={"error": str(exc), "path": path})
--- a/src/agent/app/version.py
+++ b/src/agent/app/version.py
@ -0,0 +1,69 @@
 from __future__ import annotations
 import os
 import sys
 from pathlib import Path
 from typing import Optional
 import importlib.metadata
 try:
    import tomllib
 except ModuleNotFoundError:  # pragma: no cover
    import tomli as tomllib  # type: ignore[no-redef]
 def _candidate_paths() -> list[Path]:
    paths = []
    bundle_dir: Optional[str] = getattr(sys, "_MEIPASS", None)
    if bundle_dir:
        paths.append(Path(bundle_dir) / "pyproject.toml")
    paths.append(Path(__file__).resolve().parent.parent / "pyproject.toml")
    paths.append(Path(__file__).resolve().parent / "pyproject.toml")
    paths.append(Path.cwd() / "pyproject.toml")
    return paths
 def _read_from_pyproject() -> Optional[str]:
    for path in _candidate_paths():
        if not path.exists():
            continue
        try:
            with path.open("rb") as handle:
                data = tomllib.load(handle)
        except (OSError, tomllib.TOMLDecodeError):
            continue
        project = data.get("project")
        if isinstance(project, dict):
            version = project.get("version")
            if isinstance(version, str):
                return version
        tool = data.get("tool")
        if isinstance(tool, dict):
            argus_cfg = tool.get("argus")
            if isinstance(argus_cfg, dict):
                version = argus_cfg.get("version")
                if isinstance(version, str):
                    return version
    return None
 def _detect_version() -> str:
    try:
        return importlib.metadata.version("argus-agent")
    except importlib.metadata.PackageNotFoundError:
        pass
    override = os.environ.get("AGENT_VERSION_OVERRIDE")
    if override:
        return override
    fallback = _read_from_pyproject()
    if fallback:
        return fallback
    return "0.0.0"
 VERSION: str = _detect_version()
 def get_version() -> str:
    return VERSION
--- a/src/agent/entry.py
+++ b/src/agent/entry.py
@ -0,0 +1,10 @@
 #!/usr/bin/env python3
 from __future__ import annotations
 import sys
 from app.main import main as agent_main
 if __name__ == "__main__":
    sys.exit(agent_main())
--- a/src/agent/pyproject.toml
+++ b/src/agent/pyproject.toml
@ -0,0 +1,19 @@
 [project]
 name = "argus-agent"
 version = "1.1.0"
 description = "Argus agent binary"
 readme = "README.md"
 requires-python = ">=3.11"
 dependencies = [
    "requests==2.31.0"
 ]
 [build-system]
 requires = ["setuptools>=69", "wheel"]
 build-backend = "setuptools.build_meta"
 [tool.argus]
 entry = "app.main:main"
 [tool.setuptools]
 packages = ["app"]
--- a/src/agent/scripts/agent_deployment_verify.sh
+++ b/src/agent/scripts/agent_deployment_verify.sh
@ -0,0 +1,690 @@
 #!/usr/bin/env bash
 set -euo pipefail
 LOG_PREFIX="[AGENT-VERIFY]"
 MASTER_ENDPOINT_DEFAULT=""
 AGENT_DATA_ROOT_DEFAULT="/private/argus/agent"
 AGENT_ETC_ROOT_DEFAULT="/private/argus/etc"
 REPORT_INTERVAL_DEFAULT="2"
 ALLOW_CONFIG_TOUCH="false"
 KEEP_TEST_HEALTH="false"
 log_info() {
  echo "${LOG_PREFIX} INFO  $*"
 }
 log_warn() {
  echo "${LOG_PREFIX} WARN  $*" >&2
 }
 log_error() {
  echo "${LOG_PREFIX} ERROR $*" >&2
 }
 usage() {
  cat <<'USAGE'
 Usage: agent_deployment_verify.sh [options]
 Options:
  --allow-config-touch   Enable optional config PUT dry-run check.
  --keep-test-health     Keep the temporary verify health file after checks.
  -h, --help             Show this help message.
 Environment variables:
  MASTER_ENDPOINT        (required) Master API base endpoint, e.g. http://master:3000
  AGENT_DATA_ROOT        (default: /private/argus/agent)
  AGENT_ETC_ROOT         (default: /private/argus/etc)
  VERIFY_HOSTNAME        (default: output of hostname)
  REPORT_INTERVAL_SECONDS (default: 2) Agent report interval in seconds
 USAGE
 }
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --allow-config-touch)
      ALLOW_CONFIG_TOUCH="true"
      shift
      ;;
    --keep-test-health)
      KEEP_TEST_HEALTH="true"
      shift
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      log_error "Unknown option: $1"
      usage >&2
      exit 2
      ;;
  esac
 done
 MASTER_ENDPOINT="${MASTER_ENDPOINT:-$MASTER_ENDPOINT_DEFAULT}"
 AGENT_DATA_ROOT="${AGENT_DATA_ROOT:-$AGENT_DATA_ROOT_DEFAULT}"
 AGENT_ETC_ROOT="${AGENT_ETC_ROOT:-$AGENT_ETC_ROOT_DEFAULT}"
 VERIFY_HOSTNAME="${VERIFY_HOSTNAME:-$(hostname)}"
 REPORT_INTERVAL_SECONDS="${REPORT_INTERVAL_SECONDS:-$REPORT_INTERVAL_DEFAULT}"
 if [[ -z "$MASTER_ENDPOINT" ]]; then
  log_error "MASTER_ENDPOINT is required"
  exit 2
 fi
 if ! [[ "$REPORT_INTERVAL_SECONDS" =~ ^[0-9]+$ ]] || [[ "$REPORT_INTERVAL_SECONDS" -le 0 ]]; then
  log_warn "Invalid REPORT_INTERVAL_SECONDS='$REPORT_INTERVAL_SECONDS', fallback to $REPORT_INTERVAL_DEFAULT"
  REPORT_INTERVAL_SECONDS="$REPORT_INTERVAL_DEFAULT"
 fi
 normalize_endpoint() {
  local endpoint="$1"
  if [[ "$endpoint" != http://* && "$endpoint" != https://* ]]; then
    endpoint="http://$endpoint"
  fi
  endpoint="${endpoint%/}"
  echo "$endpoint"
 }
 MASTER_BASE="$(normalize_endpoint "$MASTER_ENDPOINT")"
 NODE_DIR="$AGENT_DATA_ROOT/$VERIFY_HOSTNAME"
 NODE_JSON="$NODE_DIR/node.json"
 HEALTH_DIR="$NODE_DIR/health"
 DNS_CONF="$AGENT_ETC_ROOT/dns.conf"
 UPDATE_SCRIPT="$AGENT_ETC_ROOT/update-dns.sh"
 declare -a RESULTS_PASS=()
 declare -a RESULTS_WARN=()
 declare -a RESULTS_FAIL=()
 add_result() {
  local level="$1" message="$2"
  case "$level" in
    PASS)
      RESULTS_PASS+=("$message")
      log_info "$message"
      ;;
    WARN)
      RESULTS_WARN+=("$message")
      log_warn "$message"
      ;;
    FAIL)
      RESULTS_FAIL+=("$message")
      log_error "$message"
      ;;
  esac
 }
 HAS_JQ="0"
 if command -v jq >/dev/null 2>&1; then
  HAS_JQ="1"
 fi
 if ! command -v curl >/dev/null 2>&1; then
  log_error "curl command not found; please install curl (e.g. apt-get install -y curl)"
  exit 2
 fi
 if [[ "$HAS_JQ" == "0" ]] && ! command -v python3 >/dev/null 2>&1; then
  log_error "Neither jq nor python3 is available for JSON processing"
  exit 2
 fi
 CURL_OPTS=(--fail --show-error --silent --max-time 10)
 curl_json() {
  local url="$1"
  if ! curl "${CURL_OPTS[@]}" "$url"; then
    return 1
  fi
 }
 json_query() {
  local json="$1" jq_expr="$2" py_expr="$3"
  if [[ "$HAS_JQ" == "1" ]]; then
    if ! output=$(printf '%s' "$json" | jq -e -r "$jq_expr" 2>/dev/null); then
      return 1
    fi
    printf '%s' "$output"
    return 0
  fi
  python3 - "$py_expr" <<'PY'
 import json
 import sys
 expr = sys.argv[1]
 try:
    data = json.load(sys.stdin)
    value = eval(expr, {}, {"data": data})
 except Exception:
    sys.exit(1)
 if value is None:
    sys.exit(1)
 if isinstance(value, (dict, list)):
    print(json.dumps(value))
 else:
    print(value)
 PY
 }
 json_length() {
  local json="$1" jq_expr="$2" py_expr="$3"
  if [[ "$HAS_JQ" == "1" ]]; then
    if ! output=$(printf '%s' "$json" | jq -e "$jq_expr" 2>/dev/null); then
      return 1
    fi
    printf '%s' "$output"
    return 0
  fi
  python3 - "$py_expr" <<'PY'
 import json
 import sys
 expr = sys.argv[1]
 try:
    data = json.load(sys.stdin)
    value = eval(expr, {}, {"data": data})
 except Exception:
    sys.exit(1)
 try:
    print(len(value))
 except Exception:
    sys.exit(1)
 PY
 }
 json_has_key() {
  local json="$1" jq_expr="$2" py_expr="$3"
  if [[ "$HAS_JQ" == "1" ]]; then
    if printf '%s' "$json" | jq -e "$jq_expr" >/dev/null 2>&1; then
      return 0
    fi
    return 1
  fi
  python3 - "$py_expr" <<'PY'
 import json
 import sys
 expr = sys.argv[1]
 try:
    data = json.load(sys.stdin)
    value = eval(expr, {}, {"data": data})
 except Exception:
    sys.exit(1)
 if value:
    sys.exit(0)
 sys.exit(1)
 PY
 }
 iso_to_epoch() {
  local value="$1"
  if command -v date >/dev/null 2>&1; then
    date -d "$value" +%s 2>/dev/null && return 0
  fi
  if command -v python3 >/dev/null 2>&1; then
    python3 - "$value" <<'PY'
 import sys
 from datetime import datetime
 value = sys.argv[1]
 if value is None or value == "":
    sys.exit(1)
 if value.endswith('Z'):
    value = value[:-1] + '+00:00'
 try:
    dt = datetime.fromisoformat(value)
 except ValueError:
    sys.exit(1)
 print(int(dt.timestamp()))
 PY
    return $?
  fi
  return 1
 }
 validate_json_file() {
  local path="$1"
  if [[ "$HAS_JQ" == "1" ]]; then
    jq empty "$path" >/dev/null 2>&1 && return 0
    return 1
  fi
  if command -v python3 >/dev/null 2>&1; then
    python3 - "$path" <<'PY'
 import json
 import sys
 path = sys.argv[1]
 with open(path, 'r', encoding='utf-8') as handle:
    json.load(handle)
 PY
    return $?
  fi
  return 0
 }
 ensure_directory() {
  local dir="$1"
  if [[ ! -d "$dir" ]]; then
    log_warn "Creating missing directory $dir"
    mkdir -p "$dir"
  fi
 }
 TEST_HEALTH_FILE=""
 TEST_HEALTH_BACKUP=""
 TEST_HEALTH_EXISTED="false"
 cleanup() {
  if [[ -n "$TEST_HEALTH_FILE" ]]; then
    if [[ "$TEST_HEALTH_EXISTED" == "true" ]]; then
      printf '%s' "$TEST_HEALTH_BACKUP" > "$TEST_HEALTH_FILE"
    elif [[ "$KEEP_TEST_HEALTH" == "true" ]]; then
      :
    else
      rm -f "$TEST_HEALTH_FILE"
    fi
  fi
 }
 trap cleanup EXIT
 log_info "Starting agent deployment verification for hostname '$VERIFY_HOSTNAME'"
 # 4.2 Master health checks
 health_resp=""
 if ! health_resp=$(curl "${CURL_OPTS[@]}" -w '\n%{http_code} %{time_total}' "$MASTER_BASE/healthz" 2>/tmp/agent_verify_healthz.err); then
  error_detail=$(cat /tmp/agent_verify_healthz.err || true)
  add_result FAIL "GET /healthz failed: $error_detail"
 else
  http_meta=$(tail -n1 <<<"$health_resp")
  payload=$(head -n -1 <<<"$health_resp" || true)
  status_code=${http_meta%% *}
  elapsed=${http_meta##* }
  add_result PASS "GET /healthz status=$status_code elapsed=${elapsed}s payload=$payload"
 fi
 rm -f /tmp/agent_verify_healthz.err
 if ! readyz_resp=$(curl "${CURL_OPTS[@]}" -w '\n%{http_code} %{time_total}' "$MASTER_BASE/readyz" 2>/tmp/agent_verify_readyz.err); then
  error_detail=$(cat /tmp/agent_verify_readyz.err || true)
  add_result FAIL "GET /readyz failed: $error_detail"
  readyz_payload=""
 else
  readyz_meta=$(tail -n1 <<<"$readyz_resp")
  readyz_payload=$(head -n -1 <<<"$readyz_resp" || true)
  readyz_status=${readyz_meta%% *}
  readyz_elapsed=${readyz_meta##* }
  add_result PASS "GET /readyz status=$readyz_status elapsed=${readyz_elapsed}s"
 fi
 rm -f /tmp/agent_verify_readyz.err
 # 4.3 Nodes list and detail
 if ! nodes_json=$(curl_json "$MASTER_BASE/api/v1/master/nodes" 2>/tmp/agent_verify_nodes.err); then
  error_detail=$(cat /tmp/agent_verify_nodes.err || true)
  add_result FAIL "GET /api/v1/master/nodes failed: $error_detail"
  nodes_json=""
 fi
 rm -f /tmp/agent_verify_nodes.err
 NODE_ENTRY=""
 NODE_ID=""
 NODE_IP=""
 if [[ -n "$nodes_json" ]]; then
  if [[ "$HAS_JQ" == "1" ]]; then
    NODE_ENTRY=$(printf '%s' "$nodes_json" | jq -e --arg name "$VERIFY_HOSTNAME" '.[] | select(.name == $name)') || NODE_ENTRY=""
  else
    NODE_ENTRY=$(python3 - "$VERIFY_HOSTNAME" <<'PY'
 import json
 import sys
 hostname = sys.argv[1]
 nodes = json.load(sys.stdin)
 for node in nodes:
    if node.get("name") == hostname:
        import json as _json
        print(_json.dumps(node))
        sys.exit(0)
 sys.exit(1)
 PY
    ) || NODE_ENTRY=""
  fi
  if [[ -z "$NODE_ENTRY" ]]; then
    add_result FAIL "Current node '$VERIFY_HOSTNAME' not found in master nodes list"
  else
    if NODE_ID=$(json_query "$NODE_ENTRY" '.id' 'data["id"]'); then
      add_result PASS "Discovered node id '$NODE_ID' for hostname '$VERIFY_HOSTNAME'"
    else
      add_result FAIL "Failed to extract node id from master response"
    fi
  fi
  if [[ -n "$NODE_ENTRY" ]] && NODE_DETAIL=$(curl_json "$MASTER_BASE/api/v1/master/nodes/$NODE_ID" 2>/tmp/agent_verify_node_detail.err); then
    NODE_DETAIL_JSON="$NODE_DETAIL"
    add_result PASS "Fetched node detail for $NODE_ID"
    if NODE_IP=$(json_query "$NODE_DETAIL_JSON" '.meta_data.ip // .meta_data.host_ip // empty' 'data.get("meta_data", {}).get("ip") or data.get("meta_data", {}).get("host_ip") or ""'); then
      if [[ -n "$NODE_IP" ]]; then
        add_result PASS "Registered node IP=$NODE_IP"
      else
        add_result INFO "Node detail does not expose IP fields"
      fi
    fi
  else
    error_detail=$(cat /tmp/agent_verify_node_detail.err 2>/dev/null || true)
    add_result FAIL "Failed to fetch node detail for $NODE_ID: $error_detail"
    NODE_DETAIL_JSON=""
  fi
  rm -f /tmp/agent_verify_node_detail.err
  if stats_json=$(curl_json "$MASTER_BASE/api/v1/master/nodes/statistics" 2>/tmp/agent_verify_stats.err); then
    if total_nodes=$(json_query "$stats_json" '.total // .total_nodes' 'data.get("total") or data.get("total_nodes")'); then
      if [[ "$total_nodes" =~ ^[0-9]+$ ]] && [[ "$total_nodes" -ge 1 ]]; then
        add_result PASS "Statistics total=$total_nodes"
      else
        add_result WARN "Statistics total field not numeric: $total_nodes"
      fi
    else
      add_result WARN "Unable to read total field from statistics"
    fi
    active_nodes=""
    if [[ "$HAS_JQ" == "1" ]]; then
      active_nodes=$(printf '%s' "$stats_json" | jq -e 'if .status_statistics then (.status_statistics[] | select(.status == "online") | .count) else empty end' 2>/dev/null | head -n1 || true)
    elif command -v python3 >/dev/null 2>&1; then
      active_nodes=$(printf '%s' "$stats_json" | python3 -c 'import json,sys; data=json.load(sys.stdin); print(next((row.get("count") for row in data.get("status_statistics", []) if row.get("status")=="online"), ""))' 2>/dev/null)
    fi
    if [[ -n "$active_nodes" ]]; then
      add_result PASS "Online nodes reported by master: $active_nodes"
    fi
    if [[ "$HAS_JQ" == "1" ]]; then
      node_count=$(printf '%s' "$nodes_json" | jq 'length')
    else
      node_count=$(json_length "$nodes_json" 'length' 'len(data)')
    fi
    if [[ "$total_nodes" =~ ^[0-9]+$ ]] && [[ "$node_count" =~ ^[0-9]+$ ]] && [[ "$total_nodes" -lt "$node_count" ]]; then
      add_result WARN "Statistics total=$total_nodes less than nodes list count=$node_count"
    fi
  else
    error_detail=$(cat /tmp/agent_verify_stats.err 2>/dev/null || true)
    add_result FAIL "Failed to fetch node statistics: $error_detail"
  fi
  rm -f /tmp/agent_verify_stats.err
 else
  NODE_DETAIL_JSON=""
 fi
 # 4.4 Agent persistence checks
 if [[ -f "$NODE_JSON" ]]; then
  node_file_content="$(cat "$NODE_JSON")"
  if node_id_local=$(json_query "$node_file_content" '.id' 'data["id"]'); then
    if [[ "$NODE_ID" != "" && "$node_id_local" == "$NODE_ID" ]]; then
      add_result PASS "node.json id matches master ($NODE_ID)"
    else
      add_result FAIL "node.json id '$node_id_local' differs from master id '$NODE_ID'"
    fi
  else
    add_result FAIL "Unable to extract id from node.json"
  fi
  if node_name_local=$(json_query "$node_file_content" '.name' 'data["name"]'); then
    if [[ "$node_name_local" == "$VERIFY_HOSTNAME" ]]; then
      add_result PASS "node.json name matches $VERIFY_HOSTNAME"
    else
      add_result FAIL "node.json name '$node_name_local' differs from hostname '$VERIFY_HOSTNAME'"
    fi
  else
    add_result FAIL "Unable to extract name from node.json"
  fi
  if register_time=$(json_query "$node_file_content" '.register_time' 'data.get("register_time")'); then
    if iso_to_epoch "$register_time" >/dev/null 2>&1; then
      add_result PASS "node.json register_time valid ISO timestamp"
    else
      add_result WARN "node.json register_time invalid: $register_time"
    fi
  else
    add_result WARN "node.json missing register_time"
  fi
  if last_updated=$(json_query "$node_file_content" '.last_updated' 'data.get("last_updated")'); then
    if iso_to_epoch "$last_updated" >/dev/null 2>&1; then
      add_result PASS "node.json last_updated valid ISO timestamp"
    else
      add_result WARN "node.json last_updated invalid: $last_updated"
    fi
  else
    add_result WARN "node.json missing last_updated"
  fi
 else
  add_result FAIL "node.json not found at $NODE_JSON"
  node_file_content=""
 fi
 ensure_directory "$HEALTH_DIR"
 if [[ -d "$HEALTH_DIR" ]]; then
  shopt -s nullglob
  health_files=("$HEALTH_DIR"/*.json)
  shopt -u nullglob
  if [[ ${#health_files[@]} -eq 0 ]]; then
    add_result WARN "Health directory $HEALTH_DIR is empty"
  else
    for hf in "${health_files[@]}"; do
      base=$(basename "$hf")
      if [[ "$base" != *-* ]]; then
        add_result WARN "Health file $base does not follow <module>-*.json"
        continue
      fi
      if ! validate_json_file "$hf" >/dev/null 2>&1; then
        add_result WARN "Health file $base is not valid JSON"
      fi
    done
  fi
 else
  add_result WARN "Health directory $HEALTH_DIR missing"
 fi
 if getent hosts master.argus.com >/dev/null 2>&1; then
  resolved_ips=$(getent hosts master.argus.com | awk '{print $1}' | xargs)
  add_result PASS "master.argus.com resolves to $resolved_ips"
 else
  add_result FAIL "Failed to resolve master.argus.com"
 fi
 # 4.5 Master-Node status consistency
 sleep_interval=$((REPORT_INTERVAL_SECONDS + 2))
 if [[ -n "$NODE_DETAIL_JSON" ]]; then
  detail_pre="$NODE_DETAIL_JSON"
 else
  detail_pre=""
 fi
 if [[ -z "$detail_pre" && -n "$NODE_ID" ]]; then
  if detail_pre=$(curl_json "$MASTER_BASE/api/v1/master/nodes/$NODE_ID" 2>/tmp/agent_verify_detail_pre.err); then
    add_result PASS "Fetched node detail pre-check"
  else
    error_detail=$(cat /tmp/agent_verify_detail_pre.err 2>/dev/null || true)
    add_result FAIL "Unable to fetch node detail for status check: $error_detail"
  fi
  rm -f /tmp/agent_verify_detail_pre.err
 fi
 server_ts_pre=""
 agent_ts_pre=""
 server_ts_post=""
 agent_ts_post=""
 if [[ -n "$detail_pre" ]]; then
  server_ts_pre=$(json_query "$detail_pre" '.last_report' 'data.get("last_report")' || echo "")
  agent_ts_pre=$(json_query "$detail_pre" '.agent_last_report' 'data.get("agent_last_report")' || echo "")
  log_info "Captured initial last_report timestamps server='$server_ts_pre' agent='$agent_ts_pre'"
  sleep "$sleep_interval"
  if detail_post=$(curl_json "$MASTER_BASE/api/v1/master/nodes/$NODE_ID" 2>/tmp/agent_verify_detail_post.err); then
    server_ts_post=$(json_query "$detail_post" '.last_report' 'data.get("last_report")' || echo "")
    agent_ts_post=$(json_query "$detail_post" '.agent_last_report' 'data.get("agent_last_report")' || echo "")
    if [[ "$server_ts_post" != "$server_ts_pre" ]]; then
      add_result PASS "last_report.server_timestamp advanced (pre=$server_ts_pre post=$server_ts_post)"
    else
      add_result FAIL "last_report.server_timestamp did not change after ${sleep_interval}s"
    fi
    if [[ "$agent_ts_post" != "$agent_ts_pre" ]]; then
      add_result PASS "last_report.agent_timestamp advanced"
    else
      add_result FAIL "last_report.agent_timestamp did not change"
    fi
    if [[ -n "$node_file_content" ]]; then
      if node_last_updated=$(json_query "$node_file_content" '.last_updated' 'data.get("last_updated")'); then
        if epoch_post=$(iso_to_epoch "$server_ts_post" 2>/dev/null); then
          if node_epoch=$(iso_to_epoch "$node_last_updated" 2>/dev/null); then
            diff=$((epoch_post - node_epoch))
            [[ $diff -lt 0 ]] && diff=$((-diff))
            tolerance=$((REPORT_INTERVAL_SECONDS * 2))
            if [[ $diff -le $tolerance ]]; then
              add_result PASS "last_report.server_timestamp and node.json last_updated within tolerance ($diff s)"
            else
              add_result WARN "Timestamp gap between master ($server_ts_post) and node.json ($node_last_updated) is ${diff}s"
            fi
          fi
        fi
      fi
    fi
    NODE_DETAIL_JSON="$detail_post"
  else
    error_detail=$(cat /tmp/agent_verify_detail_post.err 2>/dev/null || true)
    add_result FAIL "Failed to fetch node detail post-check: $error_detail"
  fi
  rm -f /tmp/agent_verify_detail_post.err
 fi
 # 4.6 Health simulation
 TEST_HEALTH_FILE="$HEALTH_DIR/verify-master.json"
 ensure_directory "$HEALTH_DIR"
 if [[ -f "$TEST_HEALTH_FILE" ]]; then
  TEST_HEALTH_EXISTED="true"
  TEST_HEALTH_BACKUP="$(cat "$TEST_HEALTH_FILE")"
 else
  TEST_HEALTH_EXISTED="false"
 fi
 create_health_file() {
  local message="$1"
  cat > "$TEST_HEALTH_FILE" <<HEALTHJSON
 {"status":"ok","message":"$message"}
 HEALTHJSON
 }
 validate_health_in_master() {
  local expected_message="$1"
  local detail_json="$2"
  local message
  if message=$(json_query "$detail_json" '.health["verify-master"].message' 'data.get("health", {}).get("verify-master", {}).get("message")'); then
    if [[ "$message" == "$expected_message" ]]; then
      return 0
    fi
  fi
  return 1
 }
 remove_health_from_master() {
  local detail_json="$1"
  if json_has_key "$detail_json" '(.health | has("verify-master"))' '"verify-master" in data.get("health", {})'; then
    return 1
  fi
  return 0
 }
 health_message_one="verify $(date +%s)"
 create_health_file "$health_message_one"
 add_result PASS "Created test health file $TEST_HEALTH_FILE"
 sleep "$sleep_interval"
 if detail_health_one=$(curl_json "$MASTER_BASE/api/v1/master/nodes/$NODE_ID" 2>/tmp/agent_verify_health1.err); then
  if validate_health_in_master "$health_message_one" "$detail_health_one"; then
    add_result PASS "Master reflects verify-master health message"
  else
    add_result FAIL "Master health payload does not match test message"
  fi
 else
  error_detail=$(cat /tmp/agent_verify_health1.err 2>/dev/null || true)
  add_result FAIL "Failed to fetch node detail during health validation: $error_detail"
  detail_health_one=""
 fi
 rm -f /tmp/agent_verify_health1.err
 health_message_two="verify $(date +%s)-update"
 create_health_file "$health_message_two"
 sleep "$sleep_interval"
 if detail_health_two=$(curl_json "$MASTER_BASE/api/v1/master/nodes/$NODE_ID" 2>/tmp/agent_verify_health2.err); then
  if validate_health_in_master "$health_message_two" "$detail_health_two"; then
    add_result PASS "Master health updated to new message"
  else
    add_result FAIL "Master health message did not update"
  fi
 else
  error_detail=$(cat /tmp/agent_verify_health2.err 2>/dev/null || true)
  add_result FAIL "Failed to fetch node detail after health update: $error_detail"
  detail_health_two=""
 fi
 rm -f /tmp/agent_verify_health2.err
 rm -f "$TEST_HEALTH_FILE"
 sleep "$sleep_interval"
 if detail_health_three=$(curl_json "$MASTER_BASE/api/v1/master/nodes/$NODE_ID" 2>/tmp/agent_verify_health3.err); then
  if remove_health_from_master "$detail_health_three"; then
    add_result PASS "Master health no longer lists verify-master after removal"
  else
    add_result FAIL "Master health still contains verify-master after file deletion"
  fi
 else
  error_detail=$(cat /tmp/agent_verify_health3.err 2>/dev/null || true)
  add_result FAIL "Failed to fetch node detail after health removal: $error_detail"
 fi
 rm -f /tmp/agent_verify_health3.err
 if [[ "$TEST_HEALTH_EXISTED" == "true" ]]; then
  printf '%s' "$TEST_HEALTH_BACKUP" > "$TEST_HEALTH_FILE"
 fi
 # Optional config touch
 if [[ "$ALLOW_CONFIG_TOUCH" == "true" ]]; then
  if [[ -n "$NODE_ID" ]]; then
    payload='{"label": {"verify": "true"}}'
    if curl "${CURL_OPTS[@]}" -X PUT -H 'Content-Type: application/json' -d "$payload" "$MASTER_BASE/api/v1/master/nodes/$NODE_ID/config" >/tmp/agent_verify_config.log 2>&1; then
      add_result PASS "Config PUT dry-run succeeded"
    else
      add_result WARN "Config PUT dry-run failed: $(cat /tmp/agent_verify_config.log)"
    fi
    rm -f /tmp/agent_verify_config.log
  fi
 else
  add_result WARN "Config PUT dry-run skipped (enable with --allow-config-touch)"
 fi
 # Result summary
 echo
 echo "==== Verification Summary ===="
 for entry in "${RESULTS_PASS[@]}"; do
  printf 'PASS: %s\n' "$entry"
 done
 for entry in "${RESULTS_WARN[@]}"; do
  printf 'WARN: %s\n' "$entry"
 done
 for entry in "${RESULTS_FAIL[@]}"; do
  printf 'FAIL: %s\n' "$entry"
 done
 if [[ ${#RESULTS_FAIL[@]} -gt 0 ]]; then
  exit 1
 fi
 exit 0
--- a/src/agent/scripts/build_binary.sh
+++ b/src/agent/scripts/build_binary.sh
@ -0,0 +1,276 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 MODULE_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 BUILD_ROOT="$MODULE_ROOT/build"
 DIST_DIR="$MODULE_ROOT/dist"
 PYINSTALLER_BUILD="$BUILD_ROOT/pyinstaller"
 PYINSTALLER_SPEC="$PYINSTALLER_BUILD/spec"
 PYINSTALLER_WORK="$PYINSTALLER_BUILD/work"
 VENV_DIR="$BUILD_ROOT/venv"
 AGENT_BUILD_IMAGE="${AGENT_BUILD_IMAGE:-python:3.11-slim-bullseye}"
 AGENT_BUILD_USE_DOCKER="${AGENT_BUILD_USE_DOCKER:-1}"
 # 默认在容器内忽略代理以避免公司内网代理在 Docker 网络不可达导致 pip 失败（可用 0 关闭）
 AGENT_BUILD_IGNORE_PROXY="${AGENT_BUILD_IGNORE_PROXY:-1}"
 USED_DOCKER=0
 run_host_build() {
  echo "[INFO] Using host Python environment for build" >&2
  rm -rf "$BUILD_ROOT" "$DIST_DIR"
  mkdir -p "$PYINSTALLER_BUILD" "$DIST_DIR"
  python3 -m venv --copies "$VENV_DIR"
  # shellcheck disable=SC1091
  source "$VENV_DIR/bin/activate"
  pip install --upgrade pip
  pip install .
  pip install "pyinstaller==6.6.0"
  pyinstaller \
    --clean \
    --onefile \
    --name argus-agent \
    --distpath "$DIST_DIR" \
    --workpath "$PYINSTALLER_WORK" \
    --specpath "$PYINSTALLER_SPEC" \
    --add-data "$MODULE_ROOT/pyproject.toml:." \
    "$MODULE_ROOT/entry.py"
  chmod +x "$DIST_DIR/argus-agent"
  deactivate
 }
 run_docker_build() {
  if ! command -v docker >/dev/null 2>&1; then
    echo "[ERROR] docker 命令不存在，无法在容器内构建。请安装 Docker 或设置 AGENT_BUILD_USE_DOCKER=0" >&2
    exit 1
  fi
  USED_DOCKER=1
  echo "[INFO] Building agent binary inside $AGENT_BUILD_IMAGE" >&2
  local host_uid host_gid
  host_uid="$(id -u)"
  host_gid="$(id -g)"
  docker_env=("--rm" "-v" "$MODULE_ROOT:/workspace" "-w" "/workspace" "--env" "TARGET_UID=${host_uid}" "--env" "TARGET_GID=${host_gid}")
  pass_env_if_set() {
    local var="$1"
    local value="${!var:-}"
    if [[ -n "$value" ]]; then
      docker_env+=("--env" "$var=$value")
    fi
  }
  pass_env_if_set PIP_INDEX_URL
  pass_env_if_set PIP_EXTRA_INDEX_URL
  pass_env_if_set PIP_TRUSTED_HOST
  pass_env_if_set HTTP_PROXY
  pass_env_if_set HTTPS_PROXY
  pass_env_if_set NO_PROXY
  pass_env_if_set http_proxy
  pass_env_if_set https_proxy
  pass_env_if_set no_proxy
  pass_env_if_set AGENT_BUILD_IGNORE_PROXY
 build_script=$(cat <<'INNER'
 set -euo pipefail
 cd /workspace
 apt-get update >/dev/null
 apt-get install -y --no-install-recommends binutils >/dev/null
 rm -rf /var/lib/apt/lists/*
 rm -rf build dist
 mkdir -p build/pyinstaller dist
 python3 -m venv --copies build/venv
 source build/venv/bin/activate
 # 若指定忽略代理，则清空常见代理与 pip 镜像环境变量，避免容器内代理不可达
 if [ "${AGENT_BUILD_IGNORE_PROXY:-1}" = "1" ]; then
  unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY PIP_INDEX_URL PIP_EXTRA_INDEX_URL PIP_TRUSTED_HOST
 fi
 pip install --upgrade pip
 pip install .
 pip install pyinstaller==6.6.0
 pyinstaller \
  --clean \
  --onefile \
  --name argus-agent \
  --distpath dist \
  --workpath build/pyinstaller/work \
  --specpath build/pyinstaller/spec \
  --add-data /workspace/pyproject.toml:. \
  entry.py
 chmod +x dist/argus-agent
 TARGET_UID="${TARGET_UID:-0}"
 TARGET_GID="${TARGET_GID:-0}"
 chown -R "$TARGET_UID:$TARGET_GID" dist build 2>/dev/null || true
 python3 - <<'PY'
 from pathlib import Path
 from PyInstaller.archive.readers import CArchiveReader
 import sys
 archive = Path('dist/argus-agent')
 out_dir = Path('build/compat_check')
 out_dir.mkdir(parents=True, exist_ok=True)
 major, minor = sys.version_info[:2]
 libpython = f'libpython{major}.{minor}.so.1.0'
 expected_libs = [
    libpython,
    'libssl.so.3',
    'libcrypto.so.3',
 ]
 reader = CArchiveReader(str(archive))
 extracted = []
 missing = []
 for name in expected_libs:
    try:
        data = reader.extract(name)
    except KeyError:
        missing.append(name)
        continue
    (out_dir / name).write_bytes(data)
    extracted.append(name)
 (out_dir / 'manifest').write_text('\n'.join(extracted))
 if extracted:
    print('[INFO] Extracted libraries: ' + ', '.join(extracted))
 if missing:
    print('[WARN] Missing expected libraries in bundle: ' + ', '.join(missing))
 PY
 compat_check() {
  local lib_path="$1"
  if [[ ! -f "$lib_path" ]]; then
    echo "[WARN] Missing $lib_path for GLIBC check"
    return
  fi
  local max_glibc
  max_glibc=$(strings -a "$lib_path" | grep -Eo 'GLIBC_[0-9]+\.[0-9]+' | sort -Vu | tail -n 1 || true)
  if [[ -n "$max_glibc" ]]; then
    echo "[INFO] $lib_path references up to $max_glibc"
  else
    echo "[INFO] $lib_path does not expose GLIBC version strings"
  fi
 }
 compat_libs=()
 if [[ -f build/compat_check/manifest ]]; then
  mapfile -t compat_libs < build/compat_check/manifest
 fi
 if [[ ${#compat_libs[@]} -eq 0 ]]; then
  echo "[WARN] No libraries captured for GLIBC inspection"
 else
  for lib in "${compat_libs[@]}"; do
    compat_check "build/compat_check/$lib"
  done
 fi
 deactivate
 INNER
  )
  if ! docker run "${docker_env[@]}" "$AGENT_BUILD_IMAGE" bash -lc "$build_script"; then
    echo "[ERROR] Docker 构建失败，请检查 Docker 权限或设置 AGENT_BUILD_USE_DOCKER=0 在兼容主机上构建" >&2
    exit 1
  fi
 }
 if [[ "$AGENT_BUILD_USE_DOCKER" == "1" ]]; then
  run_docker_build
 else
  run_host_build
 fi
 if [[ ! -f "$DIST_DIR/argus-agent" ]]; then
  echo "[ERROR] Agent binary was not produced" >&2
  exit 1
 fi
 if [[ "$USED_DOCKER" != "1" ]]; then
  if [[ ! -x "$VENV_DIR/bin/python" ]]; then
    echo "[WARN] PyInstaller virtualenv missing at $VENV_DIR; skipping compatibility check" >&2
  else
    COMPAT_DIR="$BUILD_ROOT/compat_check"
    rm -rf "$COMPAT_DIR"
    mkdir -p "$COMPAT_DIR"
    EXTRACT_SCRIPT=$(cat <<'PY'
 from pathlib import Path
 from PyInstaller.archive.readers import CArchiveReader
 import sys
 archive = Path('dist/argus-agent')
 out_dir = Path('build/compat_check')
 out_dir.mkdir(parents=True, exist_ok=True)
 major, minor = sys.version_info[:2]
 libpython = f'libpython{major}.{minor}.so.1.0'
 expected_libs = [
    libpython,
    'libssl.so.3',
    'libcrypto.so.3',
 ]
 reader = CArchiveReader(str(archive))
 extracted = []
 missing = []
 for name in expected_libs:
    try:
        data = reader.extract(name)
    except KeyError:
        missing.append(name)
        continue
    (out_dir / name).write_bytes(data)
    extracted.append(name)
 (out_dir / 'manifest').write_text('\n'.join(extracted))
 if extracted:
    print('[INFO] Extracted libraries: ' + ', '.join(extracted))
 if missing:
    print('[WARN] Missing expected libraries in bundle: ' + ', '.join(missing))
 PY
 )
    "$VENV_DIR/bin/python" - <<PY
 $EXTRACT_SCRIPT
 PY
    compat_libs=()
    if [[ -f "$COMPAT_DIR/manifest" ]]; then
      mapfile -t compat_libs < "$COMPAT_DIR/manifest"
    fi
    check_glibc_version() {
      local lib_path="$1"
      if [[ ! -f "$lib_path" ]]; then
        echo "[WARN] Skipping GLIBC check; file not found: $lib_path" >&2
        return
      fi
      if command -v strings >/dev/null 2>&1; then
        local max_glibc
        max_glibc=$(strings -a "$lib_path" | grep -Eo 'GLIBC_[0-9]+\.[0-9]+' | sort -Vu | tail -n 1 || true)
        if [[ -n "$max_glibc" ]]; then
          echo "[INFO] $lib_path references up to $max_glibc"
        else
          echo "[INFO] $lib_path does not expose GLIBC version strings"
        fi
      else
        echo "[WARN] strings command unavailable; cannot inspect $lib_path" >&2
      fi
    }
    if [[ ${#compat_libs[@]} -eq 0 ]]; then
      echo "[WARN] No libraries captured for GLIBC inspection" >&2
    else
      for lib in "${compat_libs[@]}"; do
        check_glibc_version "$COMPAT_DIR/$lib"
      done
    fi
  fi
 else
  echo "[INFO] Compatibility check executed inside container"
 fi
 echo "[INFO] Agent binary generated at $DIST_DIR/argus-agent"
--- a/src/agent/tests/.gitignore
+++ b/src/agent/tests/.gitignore
@ -0,0 +1,2 @@
 private/
 tmp/
--- a/src/agent/tests/init.py
+++ b/src/agent/tests/init.py
--- a/src/agent/tests/docker-compose.yml
+++ b/src/agent/tests/docker-compose.yml
@ -0,0 +1,99 @@
 services:
  bind:
    image: ${BIND_IMAGE_TAG:-argus-bind9:latest}
    container_name: argus-bind-agent-e2e
    volumes:
      - ./private:/private
    networks:
      default:
        ipv4_address: 172.28.0.2
    environment:
      - "ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}"
      - "ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}"
    restart: always
  master:
    image: argus-master:latest
    container_name: argus-master-agent-e2e
    depends_on:
      - bind
    environment:
      - OFFLINE_THRESHOLD_SECONDS=6
      - ONLINE_THRESHOLD_SECONDS=2
      - SCHEDULER_INTERVAL_SECONDS=1
      - "ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}"
      - "ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}"
    ports:
      - "32300:3000"
    volumes:
      - ./private/argus/master:/private/argus/master
      - ./private/argus/metric/prometheus:/private/argus/metric/prometheus
      - ./private/argus/etc:/private/argus/etc
    networks:
      default:
        ipv4_address: 172.28.0.10
    restart: always
  agent:
    image: ubuntu:22.04
    container_name: argus-agent-e2e
    hostname: dev-e2euser-e2einst-pod-0
    depends_on:
      - master
      - bind
    environment:
      - MASTER_ENDPOINT=http://master.argus.com:3000
      - REPORT_INTERVAL_SECONDS=2
      - "ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}"
      - "ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}"
    volumes:
      - ./private/argus/agent/dev-e2euser-e2einst-pod-0:/private/argus/agent/dev-e2euser-e2einst-pod-0
      - ./private/argus/agent/dev-e2euser-e2einst-pod-0/health:/private/argus/agent/dev-e2euser-e2einst-pod-0/health
      - ./private/argus/etc:/private/argus/etc
      - ../dist/argus-agent:/usr/local/bin/argus-agent:ro
      - ./scripts/agent_entrypoint.sh:/usr/local/bin/agent-entrypoint.sh:ro
      - ../scripts/agent_deployment_verify.sh:/usr/local/bin/agent_deployment_verify.sh:ro
    entrypoint:
      - /usr/local/bin/agent-entrypoint.sh
    networks:
      default:
        ipv4_address: 172.28.0.20
    restart: always
  agent_env:
    image: ubuntu:22.04
    container_name: argus-agent-env-e2e
    hostname: host_abc
    depends_on:
      - master
      - bind
    environment:
      - MASTER_ENDPOINT=http://master.argus.com:3000
      - REPORT_INTERVAL_SECONDS=2
      - AGENT_ENV=prod
      - AGENT_USER=ml
      - AGENT_INSTANCE=node-3
      - AGENT_HOSTNAME=host_abc
      - "ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}"
      - "ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}"
    volumes:
      - ./private/argus/agent/host_abc:/private/argus/agent/host_abc
      - ./private/argus/agent/host_abc/health:/private/argus/agent/host_abc/health
      - ./private/argus/etc:/private/argus/etc
      - ../dist/argus-agent:/usr/local/bin/argus-agent:ro
      - ./scripts/agent_entrypoint.sh:/usr/local/bin/agent-entrypoint.sh:ro
      - ../scripts/agent_deployment_verify.sh:/usr/local/bin/agent_deployment_verify.sh:ro
    entrypoint:
      - /usr/local/bin/agent-entrypoint.sh
    networks:
      default:
        ipv4_address: 172.28.0.21
    restart: always
 networks:
  default:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: 172.28.0.0/16
--- a/src/agent/tests/scripts/00_e2e_test.sh
+++ b/src/agent/tests/scripts/00_e2e_test.sh
@ -0,0 +1,23 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 SCRIPTS=(
  "01_bootstrap.sh"
  "02_up.sh"
  "03_wait_and_assert_registration.sh"
  "04_write_health_files.sh"
  "05_verify_agent.sh"
  "06_assert_status_on_master.sh"
  "07_restart_agent_and_reregister.sh"
  "08_down.sh"
 )
 for script in "${SCRIPTS[@]}"; do
  echo "[TEST] Running $script"
  "$SCRIPT_DIR/$script"
  echo "[TEST] $script completed"
  echo
 done
 echo "[TEST] Agent module E2E tests completed"
--- a/src/agent/tests/scripts/01_bootstrap.sh
+++ b/src/agent/tests/scripts/01_bootstrap.sh
@ -0,0 +1,63 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 AGENT_ROOT="$(cd "$TEST_ROOT/.." && pwd)"
 MASTER_ROOT="$(cd "$AGENT_ROOT/../master" && pwd)"
 REPO_ROOT="$(cd "$AGENT_ROOT/../.." && pwd)"
 PRIVATE_ROOT="$TEST_ROOT/private"
 TMP_ROOT="$TEST_ROOT/tmp"
 AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
 AGENT_CONFIG_DIR="$PRIVATE_ROOT/argus/agent/$AGENT_HOSTNAME"
 AGENT_HEALTH_DIR="$PRIVATE_ROOT/argus/agent/$AGENT_HOSTNAME/health"
 MASTER_PRIVATE_DIR="$PRIVATE_ROOT/argus/master"
 METRIC_PRIVATE_DIR="$PRIVATE_ROOT/argus/metric/prometheus"
 DNS_DIR="$PRIVATE_ROOT/argus/etc"
 BIND_IMAGE_TAG="${BIND_IMAGE_TAG:-argus-bind9:latest}"
 BIND_ROOT="$(cd "$MASTER_ROOT/../bind" && pwd)"
 ensure_image() {
  local image="$1"
  if ! docker image inspect "$image" >/dev/null 2>&1; then
    echo "[ERROR] Docker image '$image' 未找到，请先运行统一构建脚本 (例如 ./build/build_images.sh) 生成所需镜像" >&2
    exit 1
  fi
 }
 mkdir -p "$AGENT_CONFIG_DIR"
 mkdir -p "$AGENT_HEALTH_DIR"
 mkdir -p "$MASTER_PRIVATE_DIR"
 mkdir -p "$METRIC_PRIVATE_DIR"
 mkdir -p "$TMP_ROOT"
 mkdir -p "$DNS_DIR"
 touch "$AGENT_HEALTH_DIR/.keep"
 # 中文提示：准备 bind 模块提供的 update-dns.sh，模拟生产下发
 if [[ -f "$BIND_ROOT/build/update-dns.sh" ]]; then
  cp "$BIND_ROOT/build/update-dns.sh" "$DNS_DIR/update-dns.sh"
  chmod +x "$DNS_DIR/update-dns.sh"
 else
  echo "[WARN] bind update script missing at $BIND_ROOT/build/update-dns.sh"
 fi
 ensure_image "argus-master:latest"
 ensure_image "$BIND_IMAGE_TAG"
 AGENT_BINARY="$AGENT_ROOT/dist/argus-agent"
 pushd "$AGENT_ROOT" >/dev/null
 ./scripts/build_binary.sh
 popd >/dev/null
 if [[ ! -x "$AGENT_BINARY" ]]; then
  echo "[ERROR] Agent binary not found at $AGENT_BINARY" >&2
  exit 1
 fi
 echo "$AGENT_BINARY" > "$TMP_ROOT/agent_binary_path"
 echo "$BIND_IMAGE_TAG" > "$TMP_ROOT/bind_image_tag"
 echo "[INFO] Agent E2E bootstrap complete"
--- a/src/agent/tests/scripts/02_up.sh
+++ b/src/agent/tests/scripts/02_up.sh
@ -0,0 +1,53 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 REPO_ROOT="$(cd "$TEST_ROOT/../../.." && pwd)"
 TMP_ROOT="$TEST_ROOT/tmp"
 ENV_FILE="$TEST_ROOT/.env"
 source "$REPO_ROOT/scripts/common/build_user.sh"
 load_build_user
 export ARGUS_BUILD_UID ARGUS_BUILD_GID
 cat > "$ENV_FILE" <<EOF
 ARGUS_BUILD_UID=$ARGUS_BUILD_UID
 ARGUS_BUILD_GID=$ARGUS_BUILD_GID
 EOF
 if [[ ! -f "$TMP_ROOT/agent_binary_path" ]]; then
  echo "[ERROR] Agent binary path missing; run 01_bootstrap.sh first" >&2
  exit 1
 fi
 AGENT_BINARY="$(cat "$TMP_ROOT/agent_binary_path")"
 if [[ ! -x "$AGENT_BINARY" ]]; then
  echo "[ERROR] Agent binary not executable: $AGENT_BINARY" >&2
  exit 1
 fi
 BIND_IMAGE_TAG_VALUE="argus-bind9:latest"
 if [[ -f "$TMP_ROOT/bind_image_tag" ]]; then
  BIND_IMAGE_TAG_VALUE="$(cat "$TMP_ROOT/bind_image_tag")"
 fi
 compose() {
  if docker compose version >/dev/null 2>&1; then
    docker compose "$@"
  else
    docker-compose "$@"
  fi
 }
 docker container rm -f argus-agent-e2e argus-agent-env-e2e argus-master-agent-e2e argus-bind-agent-e2e >/dev/null 2>&1 || true
 docker network rm tests_default >/dev/null 2>&1 || true
 pushd "$TEST_ROOT" >/dev/null
 compose down --remove-orphans || true
 BIND_IMAGE_TAG="$BIND_IMAGE_TAG_VALUE" compose up -d
 popd >/dev/null
 echo "[INFO] Master+Agent stack started"
--- a/src/agent/tests/scripts/03_wait_and_assert_registration.sh
+++ b/src/agent/tests/scripts/03_wait_and_assert_registration.sh
@ -0,0 +1,106 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 TMP_ROOT="$TEST_ROOT/tmp"
 API_BASE="http://localhost:32300/api/v1/master"
 AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
 ENV_AGENT_HOSTNAME="host_abc"
 NODE_FILE="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME/node.json"
 ENV_NODE_FILE="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME/node.json"
 mkdir -p "$TMP_ROOT"
 primary_node_id=""
 env_node_id=""
 for _ in {1..30}; do
  sleep 2
  response=$(curl -sS "$API_BASE/nodes" || true)
  if [[ -z "$response" ]]; then
    continue
  fi
  list_file="$TMP_ROOT/nodes_list.json"
  echo "$response" > "$list_file"
  readarray -t node_ids < <(python3 - "$list_file" "$AGENT_HOSTNAME" "$ENV_AGENT_HOSTNAME" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    nodes = json.load(handle)
 target_primary = sys.argv[2]
 target_env = sys.argv[3]
 primary_id = ""
 env_id = ""
 for node in nodes:
    if node.get("name") == target_primary:
        primary_id = node.get("id", "")
    if node.get("name") == target_env:
        env_id = node.get("id", "")
 print(primary_id)
 print(env_id)
 PY
  )
  primary_node_id="${node_ids[0]}"
  env_node_id="${node_ids[1]}"
  if [[ -n "$primary_node_id" && -n "$env_node_id" ]]; then
    break
  fi
 done
 if [[ -z "$primary_node_id" ]]; then
  echo "[ERROR] Primary agent did not register within timeout" >&2
  exit 1
 fi
 if [[ -z "$env_node_id" ]]; then
  echo "[ERROR] Env-variable agent did not register within timeout" >&2
  exit 1
 fi
 echo "$primary_node_id" > "$TMP_ROOT/node_id"
 echo "$env_node_id" > "$TMP_ROOT/node_id_host_abc"
 if [[ ! -f "$NODE_FILE" ]]; then
  echo "[ERROR] node.json not created at $NODE_FILE" >&2
  exit 1
 fi
 python3 - "$NODE_FILE" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 assert "id" in node and node["id"], "node.json missing id"
 PY
 if [[ ! -f "$ENV_NODE_FILE" ]]; then
  echo "[ERROR] node.json not created at $ENV_NODE_FILE" >&2
  exit 1
 fi
 python3 - "$ENV_NODE_FILE" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 assert "id" in node and node["id"], "env agent node.json missing id"
 PY
 detail_file="$TMP_ROOT/initial_detail.json"
 curl -sS "$API_BASE/nodes/$primary_node_id" -o "$detail_file"
 python3 - "$detail_file" "$TMP_ROOT/initial_ip" <<'PY'
 import json, sys, pathlib
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 ip = node["meta_data"].get("ip")
 if not ip:
    raise SystemExit("meta_data.ip missing")
 pathlib.Path(sys.argv[2]).write_text(ip)
 PY
 echo "[INFO] Agent registered with node id $primary_node_id"
 echo "[INFO] Env-variable agent registered with node id $env_node_id"
--- a/src/agent/tests/scripts/04_write_health_files.sh
+++ b/src/agent/tests/scripts/04_write_health_files.sh
@ -0,0 +1,22 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 HEALTH_DIR="$TEST_ROOT/private/argus/agent/dev-e2euser-e2einst-pod-0/health"
 cat > "$HEALTH_DIR/log-fluentbit.json" <<JSON
 {
  "status": "healthy",
  "timestamp": "2023-10-05T12:05:00Z"
 }
 JSON
 cat > "$HEALTH_DIR/metric-node-exporter.json" <<JSON
 {
  "status": "healthy",
  "timestamp": "2023-10-05T12:05:00Z"
 }
 JSON
 echo "[INFO] Health files written"
--- a/src/agent/tests/scripts/05_verify_agent.sh
+++ b/src/agent/tests/scripts/05_verify_agent.sh
@ -0,0 +1,60 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 REPO_ROOT="$(cd "$TEST_ROOT/.." && pwd)"
 VERIFY_SCRIPT="$REPO_ROOT/scripts/agent_deployment_verify.sh"
 ENV_NODE_ID_FILE="$TEST_ROOT/tmp/node_id_host_abc"
 PRIMARY_CONTAINER="argus-agent-e2e"
 ENV_CONTAINER="argus-agent-env-e2e"
 PRIMARY_HOSTNAME="dev-e2euser-e2einst-pod-0"
 ENV_HOSTNAME="host_abc"
 if ! docker ps --format '{{.Names}}' | grep -q "^${PRIMARY_CONTAINER}$"; then
  echo "[WARN] agent container not running; skip verification"
  exit 0
 fi
 if docker exec -i "$PRIMARY_CONTAINER" bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
  echo "[INFO] curl/jq already installed in agent container"
 else
  echo "[INFO] Installing curl/jq in agent container"
  docker exec -i "$PRIMARY_CONTAINER" bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
 fi
 if [[ ! -f "$VERIFY_SCRIPT" ]]; then
  echo "[ERROR] Verification script missing at $VERIFY_SCRIPT" >&2
  exit 1
 fi
 run_verifier() {
  local container="$1" hostname="$2"
  if ! docker ps --format '{{.Names}}' | grep -q "^${container}$"; then
    echo "[WARN] container $container not running; skip"
    return
  fi
  if ! docker exec -i "$container" bash -lc 'command -v /usr/local/bin/agent_deployment_verify.sh >/dev/null'; then
    echo "[ERROR] /usr/local/bin/agent_deployment_verify.sh missing in $container" >&2
    exit 1
  fi
  echo "[INFO] Running verification for $hostname in $container"
  docker exec -i "$container" env VERIFY_HOSTNAME="$hostname" /usr/local/bin/agent_deployment_verify.sh
 }
 run_verifier "$PRIMARY_CONTAINER" "$PRIMARY_HOSTNAME"
 if docker ps --format '{{.Names}}' | grep -q "^${ENV_CONTAINER}$"; then
  if docker exec -i "$ENV_CONTAINER" bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
    echo "[INFO] curl/jq already installed in env agent container"
  else
    echo "[INFO] Installing curl/jq in env agent container"
    docker exec -i "$ENV_CONTAINER" bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
  fi
  run_verifier "$ENV_CONTAINER" "$ENV_HOSTNAME"
 else
  echo "[WARN] env-driven agent container not running; skip secondary verification"
 fi
--- a/src/agent/tests/scripts/06_assert_status_on_master.sh
+++ b/src/agent/tests/scripts/06_assert_status_on_master.sh
@ -0,0 +1,78 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 TMP_ROOT="$TEST_ROOT/tmp"
 API_BASE="http://localhost:32300/api/v1/master"
 NODE_ID="$(cat "$TMP_ROOT/node_id")"
 ENV_NODE_ID="$(cat "$TMP_ROOT/node_id_host_abc")"
 ENV_HOSTNAME="host_abc"
 NODES_JSON="$TEST_ROOT/private/argus/metric/prometheus/nodes.json"
 success=false
 detail_file="$TMP_ROOT/agent_status_detail.json"
 for _ in {1..20}; do
  sleep 2
  if ! curl -sS "$API_BASE/nodes/$NODE_ID" -o "$detail_file"; then
    continue
  fi
  if python3 - "$detail_file" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 if node["status"] != "online":
    raise SystemExit(1)
 health = node.get("health", {})
 if "log-fluentbit" not in health or "metric-node-exporter" not in health:
    raise SystemExit(1)
 PY
  then
    success=true
    break
  fi
 done
 if [[ "$success" != true ]]; then
  echo "[ERROR] Node did not report health data in time" >&2
  exit 1
 fi
 if [[ ! -f "$NODES_JSON" ]]; then
  echo "[ERROR] nodes.json missing at $NODES_JSON" >&2
  exit 1
 fi
 python3 - "$NODES_JSON" "$NODE_ID" "$ENV_NODE_ID" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    nodes = json.load(handle)
 expected_primary = sys.argv[2]
 expected_env = sys.argv[3]
 ids = {entry.get("node_id") for entry in nodes}
 assert expected_primary in ids, nodes
 assert expected_env in ids, nodes
 assert len(nodes) >= 2, nodes
 PY
 echo "[INFO] Master reflects agent health and nodes.json entries"
 env_detail_file="$TMP_ROOT/env_agent_detail.json"
 curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_detail_file"
 python3 - "$env_detail_file" "$ENV_HOSTNAME" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 expected_name = sys.argv[2]
 assert node.get("name") == expected_name, node
 meta = node.get("meta_data", {})
 assert meta.get("env") == "prod", meta
 assert meta.get("user") == "ml", meta
 assert meta.get("instance") == "node-3", meta
 PY
 echo "[INFO] Env-variable agent reports expected metadata"
--- a/src/agent/tests/scripts/07_restart_agent_and_reregister.sh
+++ b/src/agent/tests/scripts/07_restart_agent_and_reregister.sh
@ -0,0 +1,254 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 TMP_ROOT="$TEST_ROOT/tmp"
 API_BASE="http://localhost:32300/api/v1/master"
 NODE_ID="$(cat "$TMP_ROOT/node_id")"
 ENV_NODE_ID_FILE="$TMP_ROOT/node_id_host_abc"
 if [[ ! -f "$ENV_NODE_ID_FILE" ]]; then
  echo "[ERROR] Env agent node id file missing at $ENV_NODE_ID_FILE" >&2
  exit 1
 fi
 ENV_NODE_ID="$(cat "$ENV_NODE_ID_FILE")"
 AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
 ENV_AGENT_HOSTNAME="host_abc"
 NETWORK_NAME="tests_default"
 NEW_AGENT_IP="172.28.0.200"
 NEW_ENV_AGENT_IP="172.28.0.210"
 ENTRYPOINT_SCRIPT="$SCRIPT_DIR/agent_entrypoint.sh"
 VERIFY_SCRIPT="$TEST_ROOT/../scripts/agent_deployment_verify.sh"
 ENV_FILE="$TEST_ROOT/.env"
 # 中文提示：重启场景也需要同样的入口脚本，确保 DNS 注册逻辑一致
 if [[ ! -f "$ENTRYPOINT_SCRIPT" ]]; then
  echo "[ERROR] agent entrypoint script missing at $ENTRYPOINT_SCRIPT" >&2
  exit 1
 fi
 if [[ ! -f "$VERIFY_SCRIPT" ]]; then
  echo "[ERROR] agent verification script missing at $VERIFY_SCRIPT" >&2
  exit 1
 fi
 if [[ ! -f "$TMP_ROOT/agent_binary_path" ]]; then
  echo "[ERROR] Agent binary path missing; rerun bootstrap" >&2
  exit 1
 fi
 AGENT_BINARY="$(cat "$TMP_ROOT/agent_binary_path")"
 if [[ ! -x "$AGENT_BINARY" ]]; then
  echo "[ERROR] Agent binary not executable: $AGENT_BINARY" >&2
  exit 1
 fi
 if [[ -f "$ENV_FILE" ]]; then
  set -a
  # shellcheck disable=SC1090
  source "$ENV_FILE"
  set +a
 else
  REPO_ROOT="$(cd "$TEST_ROOT/../../.." && pwd)"
  # shellcheck disable=SC1090
  source "$REPO_ROOT/scripts/common/build_user.sh"
  load_build_user
 fi
 AGENT_UID="${ARGUS_BUILD_UID:-2133}"
 AGENT_GID="${ARGUS_BUILD_GID:-2015}"
 compose() {
  if docker compose version >/dev/null 2>&1; then
    docker compose "$@"
  else
    docker-compose "$@"
  fi
 }
 before_file="$TMP_ROOT/before_restart.json"
 curl -sS "$API_BASE/nodes/$NODE_ID" -o "$before_file"
 prev_last_updated=$(python3 - "$before_file" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 print(node.get("last_updated", ""))
 PY
 )
 prev_ip=$(python3 - "$before_file" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 print(node["meta_data"].get("ip", ""))
 PY
 )
 initial_ip=$(cat "$TMP_ROOT/initial_ip")
 if [[ "$prev_ip" != "$initial_ip" ]]; then
  echo "[ERROR] Expected initial IP $initial_ip, got $prev_ip" >&2
  exit 1
 fi
 env_before_file="$TMP_ROOT/env_before_restart.json"
 curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_before_file"
 env_prev_last_updated=$(python3 - "$env_before_file" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 print(node.get("last_updated", ""))
 PY
 )
 env_prev_ip=$(python3 - "$env_before_file" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 print(node["meta_data"].get("ip", ""))
 PY
 )
 pushd "$TEST_ROOT" >/dev/null
 compose rm -sf agent
 compose rm -sf agent_env
 popd >/dev/null
 docker container rm -f argus-agent-e2e >/dev/null 2>&1 || true
 docker container rm -f argus-agent-env-e2e >/dev/null 2>&1 || true
 AGENT_DIR="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME"
 HEALTH_DIR="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME/health"
 ENV_AGENT_DIR="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME"
 ENV_HEALTH_DIR="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME/health"
 # 先以 sleep 方式启动容器，确保我们掌握注册时的网络状态
 if ! docker run -d \
  --name argus-agent-e2e \
  --hostname "$AGENT_HOSTNAME" \
  --network "$NETWORK_NAME" \
  --ip "$NEW_AGENT_IP" \
  -v "$AGENT_DIR:/private/argus/agent/$AGENT_HOSTNAME" \
  -v "$HEALTH_DIR:/private/argus/agent/$AGENT_HOSTNAME/health" \
  -v "$TEST_ROOT/private/argus/etc:/private/argus/etc" \
  -v "$AGENT_BINARY:/usr/local/bin/argus-agent:ro" \
  -v "$ENTRYPOINT_SCRIPT:/usr/local/bin/agent-entrypoint.sh:ro" \
  -v "$VERIFY_SCRIPT:/usr/local/bin/agent_deployment_verify.sh:ro" \
  -e MASTER_ENDPOINT=http://master.argus.com:3000 \
  -e REPORT_INTERVAL_SECONDS=2 \
  -e ARGUS_BUILD_UID="$AGENT_UID" \
  -e ARGUS_BUILD_GID="$AGENT_GID" \
  --entrypoint /usr/local/bin/agent-entrypoint.sh \
  ubuntu:22.04 >/dev/null; then
  echo "[ERROR] Failed to start agent container with custom IP" >&2
  exit 1
 fi
 success=false
 detail_file="$TMP_ROOT/post_restart.json"
 for _ in {1..20}; do
  sleep 3
  if ! curl -sS "$API_BASE/nodes/$NODE_ID" -o "$detail_file"; then
    continue
  fi
  if python3 - "$detail_file" "$prev_last_updated" "$NODE_ID" "$prev_ip" "$NEW_AGENT_IP" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 prev_last_updated = sys.argv[2]
 expected_id = sys.argv[3]
 old_ip = sys.argv[4]
 expected_ip = sys.argv[5]
 last_updated = node.get("last_updated")
 current_ip = node["meta_data"].get("ip")
 assert node["id"] == expected_id
 if current_ip != expected_ip:
    raise SystemExit(1)
 if current_ip == old_ip:
    raise SystemExit(1)
 if not last_updated or last_updated == prev_last_updated:
    raise SystemExit(1)
 PY
  then
    success=true
    break
  fi
 done
 if [[ "$success" != true ]]; then
  echo "[ERROR] Agent did not report expected new IP $NEW_AGENT_IP after restart" >&2
  exit 1
 fi
 echo "[INFO] Agent restart produced successful re-registration with IP change"
 # ---- Restart env-driven agent without metadata environment variables ----
 if [[ ! -d "$ENV_AGENT_DIR" ]]; then
  echo "[ERROR] Env agent data dir missing at $ENV_AGENT_DIR" >&2
  exit 1
 fi
 if [[ ! -d "$ENV_HEALTH_DIR" ]]; then
  mkdir -p "$ENV_HEALTH_DIR"
 fi
 if ! docker run -d \
  --name argus-agent-env-e2e \
  --hostname "$ENV_AGENT_HOSTNAME" \
  --network "$NETWORK_NAME" \
  --ip "$NEW_ENV_AGENT_IP" \
  -v "$ENV_AGENT_DIR:/private/argus/agent/$ENV_AGENT_HOSTNAME" \
  -v "$ENV_HEALTH_DIR:/private/argus/agent/$ENV_AGENT_HOSTNAME/health" \
  -v "$TEST_ROOT/private/argus/etc:/private/argus/etc" \
  -v "$AGENT_BINARY:/usr/local/bin/argus-agent:ro" \
  -v "$ENTRYPOINT_SCRIPT:/usr/local/bin/agent-entrypoint.sh:ro" \
  -v "$VERIFY_SCRIPT:/usr/local/bin/agent_deployment_verify.sh:ro" \
  -e MASTER_ENDPOINT=http://master.argus.com:3000 \
  -e REPORT_INTERVAL_SECONDS=2 \
  -e ARGUS_BUILD_UID="$AGENT_UID" \
  -e ARGUS_BUILD_GID="$AGENT_GID" \
  --entrypoint /usr/local/bin/agent-entrypoint.sh \
  ubuntu:22.04 >/dev/null; then
  echo "[ERROR] Failed to start env-driven agent container without metadata env" >&2
  exit 1
 fi
 env_success=false
 env_detail_file="$TMP_ROOT/env_post_restart.json"
 for _ in {1..20}; do
  sleep 3
  if ! curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_detail_file"; then
    continue
  fi
  if python3 - "$env_detail_file" "$env_prev_last_updated" "$ENV_NODE_ID" "$env_prev_ip" "$NEW_ENV_AGENT_IP" <<'PY'
 import json, sys
 with open(sys.argv[1]) as handle:
    node = json.load(handle)
 prev_last_updated = sys.argv[2]
 expected_id = sys.argv[3]
 old_ip = sys.argv[4]
 expected_ip = sys.argv[5]
 last_updated = node.get("last_updated")
 current_ip = node["meta_data"].get("ip")
 meta = node.get("meta_data", {})
 assert node["id"] == expected_id
 if current_ip != expected_ip:
    raise SystemExit(1)
 if current_ip == old_ip:
    raise SystemExit(1)
 if not last_updated or last_updated == prev_last_updated:
    raise SystemExit(1)
 if meta.get("env") != "prod" or meta.get("user") != "ml" or meta.get("instance") != "node-3":
    raise SystemExit(1)
 PY
  then
    env_success=true
    break
  fi
 done
 if [[ "$env_success" != true ]]; then
  echo "[ERROR] Env-driven agent did not reuse persisted metadata after restart" >&2
  exit 1
 fi
 echo "[INFO] Env-driven agent restart succeeded with persisted metadata"
--- a/src/agent/tests/scripts/08_down.sh
+++ b/src/agent/tests/scripts/08_down.sh
@ -0,0 +1,36 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 ENV_FILE="$TEST_ROOT/.env"
 compose() {
  if docker compose version >/dev/null 2>&1; then
    docker compose "$@"
  else
    docker-compose "$@"
  fi
 }
 docker container rm -f argus-agent-e2e argus-agent-env-e2e >/dev/null 2>&1 || true
 pushd "$TEST_ROOT" >/dev/null
 compose down --remove-orphans
 popd >/dev/null
 if [[ -d "$TEST_ROOT/private" ]]; then
  docker run --rm \
    -v "$TEST_ROOT/private:/target" \
    ubuntu:24.04 \
    chown -R "$(id -u):$(id -g)" /target >/dev/null 2>&1 || true
  rm -rf "$TEST_ROOT/private"
 fi
 rm -rf "$TEST_ROOT/tmp"
 if [[ -f "$ENV_FILE" ]]; then
  rm -f "$ENV_FILE"
 fi
 echo "[INFO] Agent E2E environment cleaned up"
--- a/src/agent/tests/scripts/agent_entrypoint.sh
+++ b/src/agent/tests/scripts/agent_entrypoint.sh
@ -0,0 +1,79 @@
 #!/usr/bin/env bash
 set -euo pipefail
 LOG_PREFIX="[AGENT-ENTRYPOINT]"
 DNS_SCRIPT="/private/argus/etc/update-dns.sh"
 DNS_CONF="/private/argus/etc/dns.conf"
 TARGET_DOMAIN="master.argus.com"
 AGENT_UID="${ARGUS_BUILD_UID:-2133}"
 AGENT_GID="${ARGUS_BUILD_GID:-2015}"
 AGENT_HOSTNAME="${HOSTNAME:-unknown}"
 AGENT_DATA_DIR="/private/argus/agent/${AGENT_HOSTNAME}"
 AGENT_HEALTH_DIR="${AGENT_DATA_DIR}/health"
 RUNTIME_GROUP="argusagent"
 RUNTIME_USER="argusagent"
 log() {
  echo "${LOG_PREFIX} $*"
 }
 mkdir -p "$AGENT_DATA_DIR" "$AGENT_HEALTH_DIR"
 chown -R "$AGENT_UID:$AGENT_GID" "$AGENT_DATA_DIR" "$AGENT_HEALTH_DIR" 2>/dev/null || true
 chown -R "$AGENT_UID:$AGENT_GID" "/private/argus/etc" 2>/dev/null || true
 if ! getent group "$AGENT_GID" >/dev/null 2>&1; then
  groupadd -g "$AGENT_GID" "$RUNTIME_GROUP"
 else
  RUNTIME_GROUP="$(getent group "$AGENT_GID" | cut -d: -f1)"
 fi
 if ! getent passwd "$AGENT_UID" >/dev/null 2>&1; then
  useradd -u "$AGENT_UID" -g "$AGENT_GID" -M -s /bin/bash "$RUNTIME_USER"
 else
  RUNTIME_USER="$(getent passwd "$AGENT_UID" | cut -d: -f1)"
 fi
 log "运行用户: $RUNTIME_USER ($AGENT_UID:$AGENT_GID)"
 # 中文提示：等待 bind 下发的 update-dns.sh 脚本
 for _ in {1..30}; do
  if [[ -x "$DNS_SCRIPT" ]]; then
    break
  fi
  log "等待 update-dns.sh 准备就绪..."
  sleep 1
 done
 if [[ -x "$DNS_SCRIPT" ]]; then
  log "执行 update-dns.sh 更新容器 DNS"
  while true; do
    if "$DNS_SCRIPT"; then
      log "update-dns.sh 执行成功"
      break
    fi
    log "update-dns.sh 执行失败，3 秒后重试"
    sleep 3
  done
 else
  log "未获取到 update-dns.sh，使用镜像默认 DNS"
 fi
 # 中文提示：记录当前 dns.conf 内容，便于排查
 if [[ -f "$DNS_CONF" ]]; then
  log "dns.conf 内容: $(tr '\n' ' ' < "$DNS_CONF")"
 else
  log "dns.conf 暂未生成"
 fi
 # 中文提示：尝试解析 master 域名，失败不阻塞但会打日志
 for _ in {1..30}; do
  if getent hosts "$TARGET_DOMAIN" >/dev/null 2>&1; then
    MASTER_IP=$(getent hosts "$TARGET_DOMAIN" | awk '{print $1}' | head -n 1)
    log "master.argus.com 解析成功: $MASTER_IP"
    break
  fi
  sleep 1
 done
 log "启动 argus-agent"
 exec su -s /bin/bash -c /usr/local/bin/argus-agent "$RUNTIME_USER"
--- a/src/agent/tests/test_config_metadata.py
+++ b/src/agent/tests/test_config_metadata.py
@ -0,0 +1,151 @@
 from __future__ import annotations
 import os
 import unittest
 from contextlib import contextmanager
 from unittest.mock import patch
 from app.config import AgentConfig, load_config
@contextmanager
 def temp_env(**overrides: str | None):
    originals: dict[str, str | None] = {}
    try:
        for key, value in overrides.items():
            originals[key] = os.environ.get(key)
            if value is None:
                os.environ.pop(key, None)
            else:
                os.environ[key] = value
        yield
    finally:
        for key, original in originals.items():
            if original is None:
                os.environ.pop(key, None)
            else:
                os.environ[key] = original
 class LoadConfigMetadataTests(unittest.TestCase):
    @patch("app.config.Path.mkdir")
    def test_metadata_from_environment_variables(self, mock_mkdir):
        with temp_env(
            MASTER_ENDPOINT="http://master.local",
            AGENT_HOSTNAME="dev-user-one-pod",
            AGENT_ENV="prod",
            AGENT_USER="ops",
            AGENT_INSTANCE="node-1",
        ):
            config = load_config()
        self.assertEqual(config.environment, "prod")
        self.assertEqual(config.user, "ops")
        self.assertEqual(config.instance, "node-1")
        mock_mkdir.assert_called()
    @patch("app.config.Path.mkdir")
    def test_metadata_falls_back_to_hostname(self, mock_mkdir):
        with temp_env(
            MASTER_ENDPOINT="http://master.local",
            AGENT_HOSTNAME="qa-team-abc-pod-2",
            AGENT_ENV=None,
            AGENT_USER=None,
            AGENT_INSTANCE=None,
        ):
            config = load_config()
        self.assertEqual(config.environment, "qa")
        self.assertEqual(config.user, "team")
        self.assertEqual(config.instance, "abc")
        mock_mkdir.assert_called()
    @patch("app.config._load_metadata_from_state", return_value=("prod", "ops", "node-1"))
    @patch("app.config.Path.mkdir")
    def test_metadata_from_node_state(self, mock_mkdir, mock_state):
        with temp_env(
            MASTER_ENDPOINT="http://master.local",
            AGENT_HOSTNAME="host_abc",
            AGENT_ENV=None,
            AGENT_USER=None,
            AGENT_INSTANCE=None,
        ):
            config = load_config()
        self.assertEqual(config.environment, "prod")
        self.assertEqual(config.user, "ops")
        self.assertEqual(config.instance, "node-1")
        mock_state.assert_called_once()
        mock_mkdir.assert_called()
    @patch("app.config.Path.mkdir")
    def test_partial_environment_variables_fallback(self, mock_mkdir):
        with temp_env(
            MASTER_ENDPOINT="http://master.local",
            AGENT_HOSTNAME="stage-ml-001-node",
            AGENT_ENV="prod",
            AGENT_USER=None,
            AGENT_INSTANCE=None,
        ):
            config = load_config()
        self.assertEqual(config.environment, "stage")
        self.assertEqual(config.user, "ml")
        self.assertEqual(config.instance, "001")
        mock_mkdir.assert_called()
    @patch("app.config.Path.mkdir")
    def test_invalid_hostname_raises_error(self, mock_mkdir):
        with temp_env(
            MASTER_ENDPOINT="http://master.local",
            AGENT_HOSTNAME="invalidhostname",
            AGENT_ENV=None,
            AGENT_USER=None,
            AGENT_INSTANCE=None,
        ):
            with self.assertRaises(ValueError):
                load_config()
        mock_mkdir.assert_not_called()
 class CollectMetadataTests(unittest.TestCase):
    @patch("app.collector._detect_ip_address", return_value="127.0.0.1")
    @patch("app.collector._detect_gpu_count", return_value=0)
    @patch("app.collector._detect_memory_bytes", return_value=1024)
    @patch("app.collector._detect_cpu_count", return_value=8)
    def test_collect_metadata_uses_config_fields(
        self,
        mock_cpu,
        mock_memory,
        mock_gpu,
        mock_ip,
    ):
        config = AgentConfig(
            hostname="dev-user-001-pod",
            environment="prod",
            user="ops",
            instance="node-1",
            node_file="/tmp/node.json",
            version="1.0.0",
            master_endpoint="http://master.local",
            report_interval_seconds=60,
            health_dir="/tmp/health",
        )
        from app.collector import collect_metadata
        metadata = collect_metadata(config)
        self.assertEqual(metadata["env"], "prod")
        self.assertEqual(metadata["user"], "ops")
        self.assertEqual(metadata["instance"], "node-1")
        self.assertEqual(metadata["hostname"], "dev-user-001-pod")
        self.assertEqual(metadata["ip"], "127.0.0.1")
        self.assertEqual(metadata["cpu_number"], 8)
        self.assertEqual(metadata["memory_in_bytes"], 1024)
        self.assertEqual(metadata["gpu_number"], 0)
 if __name__ == "__main__":
    unittest.main()
--- a/src/alert/README.md
+++ b/src/alert/README.md
@ -0,0 +1,31 @@
 # Alertmanager
 ## 构建
 1. 首先设置构建和部署的环境变量, 在项目根目录下执行：
 ```bash
 cp src/alert/tests/.env.example src/alert/tests/.env
 ```
 然后找到复制出来的.env文件，修改环境变量。
 2. 使用脚本构建，在项目根目录下执行：
 ```bash
 bash src/alert/alertmanager/build/build.sh
 ```
 构建成功后，会在项目根目录下生成argus-alertmanager-latest.tar
 ## 部署
 提供docker-compose部署。在src/alert/tests目录下
 ```bash
 docker-compose up -d
 ```
 ## 动态配置
 配置文件放在`/private/argus/alert/alertmanager/alertmanager.yml`下，修改alertmanager.yml后，调用`http://alertmanager.alert.argus.com:9093/-/reload`接口(POST)可以重新加载配置.
 ```bash
 curl -X POST http://localhost:9093/-/reload
 ```
--- a/src/alert/alertmanager/build/Dockerfile
+++ b/src/alert/alertmanager/build/Dockerfile
@ -0,0 +1,102 @@
 # 基于 Ubuntu 24.04
 FROM ubuntu:24.04
 # 切换到 root 用户
 USER root
 # 安装必要依赖
 RUN apt-get update && \
    apt-get install -y wget supervisor net-tools inetutils-ping vim ca-certificates passwd && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
 # 设置 Alertmanager 版本（与本地离线包保持一致）
 ARG ALERTMANAGER_VERSION=0.28.1
 ARG ALERTMANAGER_ARCH=amd64
 # 使用仓库内预置的离线包构建（无需联网）
 COPY src/alert/alertmanager/build/alertmanager-${ALERTMANAGER_VERSION}.linux-${ALERTMANAGER_ARCH}.tar.gz /tmp/
 RUN tar xvf /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-${ALERTMANAGER_ARCH}.tar.gz -C /tmp && \
    mv /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-${ALERTMANAGER_ARCH} /usr/local/alertmanager && \
    rm -f /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-${ALERTMANAGER_ARCH}.tar.gz
 ENV ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
 ARG ARGUS_BUILD_UID=2133
 ARG ARGUS_BUILD_GID=2015
 ENV ARGUS_BUILD_UID=${ARGUS_BUILD_UID}
 ENV ARGUS_BUILD_GID=${ARGUS_BUILD_GID}
 RUN mkdir -p /usr/share/alertmanager && \
    mkdir -p ${ALERTMANAGER_BASE_PATH} && \
    mkdir -p /private/argus/etc && \
    rm -rf /alertmanager && \
    ln -s ${ALERTMANAGER_BASE_PATH} /alertmanager
 # 确保 ubuntu 账户存在并使用 ARGUS_BUILD_UID/GID
 RUN set -eux; \
    # 确保存在目标 GID 的组；若不存在则优先尝试将 ubuntu 组改为该 GID，否则创建新组
    if getent group "${ARGUS_BUILD_GID}" >/dev/null; then \
      :; \
    else \
      if getent group ubuntu >/dev/null; then \
        groupmod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
      else \
        groupadd -g "${ARGUS_BUILD_GID}" ubuntu || groupadd -g "${ARGUS_BUILD_GID}" argus || true; \
      fi; \
    fi; \
    # 创建或调整 ubuntu 用户
    if id ubuntu >/dev/null 2>&1; then \
      # 设置主组为目标 GID（可用 GID 数字指定）
      usermod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
      # 若目标 UID 未被占用，则更新 ubuntu 的 UID
      if [ "$(id -u ubuntu)" != "${ARGUS_BUILD_UID}" ] && ! getent passwd "${ARGUS_BUILD_UID}" >/dev/null; then \
        usermod -u "${ARGUS_BUILD_UID}" ubuntu || true; \
      fi; \
    else \
      useradd -m -s /bin/bash -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" ubuntu || true; \
    fi; \
    # 调整关键目录属主为 ubuntu UID/GID
    chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /usr/share/alertmanager /alertmanager ${ALERTMANAGER_BASE_PATH} /private/argus/etc /usr/local/bin || true
 # 配置内网 apt 源 (如果指定了内网选项)
 RUN if [ "$USE_INTRANET" = "true" ]; then \
        echo "Configuring intranet apt sources..." && \
        cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
        echo "deb [trusted=yes] http://10.68.64.1/ubuntu2204/ jammy main" > /etc/apt/sources.list && \
        echo 'Acquire::https::Verify-Peer "false";' > /etc/apt/apt.conf.d/99disable-ssl-check && \
        echo 'Acquire::https::Verify-Host "false";' >> /etc/apt/apt.conf.d/99disable-ssl-check; \
    fi
 # 配置部署时使用的 apt 源
 RUN if [ "$USE_INTRANET" = "true" ]; then \
    echo "deb [trusted=yes] https://10.92.132.52/mirrors/ubuntu2204/ jammy main" > /etc/apt/sources.list; \
    fi
 # 创建 supervisor 日志目录
 RUN mkdir -p /var/log/supervisor
 # 复制 supervisor 配置文件
 COPY src/alert/alertmanager/build/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
 # 复制启动脚本
 COPY src/alert/alertmanager/build/start-am-supervised.sh /usr/local/bin/start-am-supervised.sh
 RUN chmod +x /usr/local/bin/start-am-supervised.sh
 # 复制 Alertmanager 配置文件
 COPY src/alert/alertmanager/build/alertmanager.yml /etc/alertmanager/alertmanager.yml
 RUN chmod +x /etc/alertmanager/alertmanager.yml
 # COPY src/alert/alertmanager/build/alertmanager.yml ${ALERTMANAGER_BASE_PATH}/alertmanager.yml
 # 复制 DNS 监控脚本
 COPY src/alert/alertmanager/build/dns-monitor.sh /usr/local/bin/dns-monitor.sh
 RUN chmod +x /usr/local/bin/dns-monitor.sh
 # 保持 root 用户，由 supervisor 控制 user 切换
 USER root
 # 暴露端口（Alertmanager 默认端口 9093）
 EXPOSE 9093
 # 使用 supervisor 作为入口点
 CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
--- a/src/alert/alertmanager/build/alertmanager-0.28.1.linux-amd64.tar.gz
+++ b/src/alert/alertmanager/build/alertmanager-0.28.1.linux-amd64.tar.gz
--- a/src/alert/alertmanager/build/alertmanager.yml
+++ b/src/alert/alertmanager/build/alertmanager.yml
@ -0,0 +1,19 @@
 global:
  resolve_timeout: 5m
 route:
  group_by: ['alertname', 'instance']   # 分组：相同 alertname + instance 的告警合并
  group_wait: 30s        # 第一个告警后，等 30s 看是否有同组告警一起发
  group_interval: 5m     # 同组告警变化后，至少 5 分钟再发一次
  repeat_interval: 3h    # 相同告警，3 小时重复提醒一次
  receiver: 'null'
 receivers:
  - name: 'null'
 inhibit_rules:
  - source_match:
      severity: 'critical'     # critical 告警存在时
    target_match:
      severity: 'warning'      # 抑制相同 instance 的 warning 告警
    equal: ['instance']
--- a/src/alert/alertmanager/build/build.sh
+++ b/src/alert/alertmanager/build/build.sh
@ -0,0 +1,13 @@
 #!/bin/bash
 set -euo pipefail
 docker pull ubuntu:24.04
 source src/alert/tests/.env
 docker build \
  --build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} \
  --build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID} \
  -f src/alert/alertmanager/build/Dockerfile \
  -t argus-alertmanager:latest .
 docker save -o argus-alertmanager-latest.tar argus-alertmanager:latest
--- a/src/alert/alertmanager/build/dns-monitor.sh
+++ b/src/alert/alertmanager/build/dns-monitor.sh
@ -0,0 +1,68 @@
 #!/bin/bash
 # DNS监控脚本 - 每10秒检查dns.conf是否有变化
 # 如果有变化则执行update-dns.sh脚本
 DNS_CONF="/private/argus/etc/dns.conf"
 DNS_BACKUP="/tmp/dns.conf.backup"
 UPDATE_SCRIPT="/private/argus/etc/update-dns.sh"
 LOG_FILE="/var/log/supervisor/dns-monitor.log"
 # 确保日志文件存在
 touch "$LOG_FILE"
 log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') [DNS-Monitor] $1" >> "$LOG_FILE"
 }
 log_message "DNS监控脚本启动"
 while true; do
    if [ -f "$DNS_CONF" ]; then
        if [ -f "$DNS_BACKUP" ]; then
            # 比较文件内容
            if ! cmp -s "$DNS_CONF" "$DNS_BACKUP"; then
                log_message "检测到DNS配置变化"
                # 更新备份文件
                cp "$DNS_CONF" "$DNS_BACKUP"
                # 执行更新脚本
                if [ -x "$UPDATE_SCRIPT" ]; then
                    log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
                    "$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
                    if [ $? -eq 0 ]; then
                        log_message "DNS更新脚本执行成功"
                    else
                        log_message "DNS更新脚本执行失败"
                    fi
                else
                    log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
                fi
            fi
        else
            # 第一次检测到配置文件，执行更新脚本
            if [ -x "$UPDATE_SCRIPT" ]; then
                log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
                "$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
                if [ $? -eq 0 ]; then
                    log_message "DNS更新脚本执行成功"
 		    # 第一次运行，创建备份并执行更新
 		    cp "$DNS_CONF" "$DNS_BACKUP"
 		    log_message "创建DNS配置备份文件"
                else
                    log_message "DNS更新脚本执行失败"
                fi
            else
                log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
            fi
        fi
    else
        log_message "警告: DNS配置文件不存在: $DNS_CONF"
    fi
    sleep 10
 done
--- a/src/alert/alertmanager/build/fetch-dist.sh
+++ b/src/alert/alertmanager/build/fetch-dist.sh
@ -0,0 +1,23 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # 下载 Alertmanager 离线安装包到本目录，用于 Docker 构建时 COPY
 # 用法：
 #   ./fetch-dist.sh [version]
 # 示例：
 #   ./fetch-dist.sh 0.28.1
 #   ARCH=arm64 ./fetch-dist.sh 0.28.1
 VER="${1:-0.28.1}"
 ARCH="${ARCH:-amd64}"   # amd64 或 arm64
 OUT="alertmanager-${VER}.linux-${ARCH}.tar.gz"
 URL="https://github.com/prometheus/alertmanager/releases/download/v${VER}/${OUT}"
 if [[ -f "$OUT" ]]; then
  echo "[INFO] $OUT already exists, skip download"
  exit 0
 fi
 echo "[INFO] Downloading $URL"
 curl -fL --retry 3 --connect-timeout 10 -o "$OUT" "$URL"
 echo "[OK] Saved to $(pwd)/$OUT"
--- a/src/alert/alertmanager/build/start-am-supervised.sh
+++ b/src/alert/alertmanager/build/start-am-supervised.sh
@ -0,0 +1,25 @@
 #!/bin/bash
 set -euo pipefail
 echo "[INFO] Starting Alertmanager under supervisor..."
 ALERTMANAGER_BASE_PATH=${ALERTMANAGER_BASE_PATH:-/private/argus/alert/alertmanager}
 echo "[INFO] Alertmanager base path: ${ALERTMANAGER_BASE_PATH}"
 # 使用容器内的 /etc/alertmanager/alertmanager.yml 作为配置文件，避免写入挂载卷导致的权限问题
 echo "[INFO] Using /etc/alertmanager/alertmanager.yml as configuration"
 # 记录容器 IP 地址
 DOMAIN=alertmanager.alert.argus.com
 IP=$(ifconfig | grep -A 1 eth0 | grep inet | awk '{print $2}')
 echo "current IP: ${IP}"
 echo "${IP}" > /private/argus/etc/${DOMAIN}
 chmod 755 /private/argus/etc/${DOMAIN}
 echo "[INFO] Starting Alertmanager process..."
 # 启动 Alertmanager 主进程
 exec /usr/local/alertmanager/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager --cluster.listen-address=""
--- a/src/alert/alertmanager/build/supervisord.conf
+++ b/src/alert/alertmanager/build/supervisord.conf
@ -0,0 +1,39 @@
 [supervisord]
 nodaemon=true
 logfile=/var/log/supervisor/supervisord.log
 pidfile=/var/run/supervisord.pid
 user=root
 [program:alertmanager]
 command=/usr/local/bin/start-am-supervised.sh
 user=ubuntu
 stdout_logfile=/var/log/supervisor/alertmanager.log
 stderr_logfile=/var/log/supervisor/alertmanager_error.log
 autorestart=true
 startretries=3
 startsecs=10
 stopwaitsecs=20
 killasgroup=true
 stopasgroup=true
 [program:dns-monitor]
 command=/usr/local/bin/dns-monitor.sh
 user=root
 stdout_logfile=/var/log/supervisor/dns-monitor.log
 stderr_logfile=/var/log/supervisor/dns-monitor_error.log
 autorestart=true
 startretries=3
 startsecs=5
 stopwaitsecs=10
 killasgroup=true
 stopasgroup=true
 [unix_http_server]
 file=/var/run/supervisor.sock
 chmod=0700
 [supervisorctl]
 serverurl=unix:///var/run/supervisor.sock
 [rpcinterface:supervisor]
 supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
--- a/src/alert/alertmanager/config/rule_files/README.md
+++ b/src/alert/alertmanager/config/rule_files/README.md
@ -0,0 +1,60 @@
 # 告警配置
 > 参考：[自定义Prometheus告警规则](https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule)
 在Prometheus中配置告警的有两个步骤：
 1. 写告警规则文件（rules文件）
 2. 在promethues.yml里加载规则，并配置Alertmanager
 ## 1. 编写告警规则文件
 告警规则如下：
 ```yml
 groups:
  - name: example-rules
    interval: 30s  # 每30秒评估一次
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 已宕机"
          description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
      - alert: HighCpuUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
 ```
 其中：
 - `alert`：告警规则的名称。
 - `expr`：基于PromQL表达式告警触发条件，用于计算是否有时间序列满足该条件。
 - `for`：评估等待时间，可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
 - `labels`：自定义标签，允许用户指定要附加到告警上的一组附加标签，可以在Alertmanager中做路由和分组。
 - `annotations`：用于指定一组附加信息，比如用于描述告警详细信息的文字等，annotations的内容在告警产生时会一同作为参数发送到Alertmanager。可以提供告警摘要和详细信息。
 ## 2. promothues.yml里引用
 在prometheus.yml中加上`rule_files`和`alerting`:
 ```yml
 global:
  [ evaluation_interval: <duration> | default = 1m ]
 rule_files:
  [ - <filepath_glob> ... ]
 alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - "alertmanager.alert.argus.com:9093"   # Alertmanager 地址
 ```
--- a/src/alert/alertmanager/config/rule_files/example_rules.yml
+++ b/src/alert/alertmanager/config/rule_files/example_rules.yml
@ -0,0 +1,37 @@
 groups:
  - name: example-rules
    interval: 30s  # 每30秒评估一次
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 已宕机"
          description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
      - alert: HighCpuUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "实例 {{ $labels.instance }} 内存使用率超过 80% 持续 5 分钟。"
      - alert: DiskSpaceLow
        expr: (node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} - node_filesystem_free_bytes{fstype!~"tmpfs|overlay"}) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "实例 {{ $labels.instance }} 磁盘空间不足超过 90% 持续 10 分钟。"
--- a/src/alert/tests/.env
+++ b/src/alert/tests/.env
@ -0,0 +1,5 @@
 DATA_ROOT=/home/argus/tmp/private/argus
 ARGUS_BUILD_UID=1048
 ARGUS_BUILD_GID=1048
 USE_INTRANET=false
--- a/src/alert/tests/.env.example
+++ b/src/alert/tests/.env.example
@ -0,0 +1,5 @@
 DATA_ROOT=/home/argus/tmp/private/argus
 ARGUS_BUILD_UID=1048
 ARGUS_BUILD_GID=1048
 USE_INTRANET=false
--- a/src/alert/tests/docker-compose.yml
+++ b/src/alert/tests/docker-compose.yml
@ -0,0 +1,37 @@
 services:
  alertmanager:
    build:
      context: ../../../
      dockerfile: src/alert/alertmanager/build/Dockerfile
      args:
        ARGUS_BUILD_UID: ${ARGUS_BUILD_UID:-2133}
        ARGUS_BUILD_GID: ${ARGUS_BUILD_GID:-2015}
        USE_INTRANET: ${USE_INTRANET:-false}
    image: argus-alertmanager:latest
    container_name: argus-alertmanager
    environment:
      - ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
      - ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
      - ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
    ports:
      - "${ARGUS_PORT:-9093}:9093"
    volumes:
      - ${DATA_ROOT:-./data}/alert/alertmanager:/private/argus/alert/alertmanager
      - ${DATA_ROOT:-./data}/etc:/private/argus/etc
    networks:
      - argus-debug-net
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
 networks:
  argus-debug-net:
    driver: bridge
    name: argus-debug-net
 volumes:
  alertmanager_data:
    driver: local
--- a/src/alert/tests/scripts/verify_alertmanager.sh
+++ b/src/alert/tests/scripts/verify_alertmanager.sh
@ -0,0 +1,113 @@
 #!/bin/bash
 # verify_alertmanager.sh
 # 用于部署后验证 Prometheus 与 Alertmanager 通信链路是否正常
 set -euo pipefail
 #=============================
 # 基础配置
 #=============================
 PROM_URL="${PROM_URL:-http://prom.metric.argus.com:9090}"
 ALERT_URL="${ALERT_URL:-http://alertmanager.alert.argus.com:9093}"
 # TODO: 根据实际部署环境调整规则目录
 DATA_ROOT="${DATA_ROOT:-/private/argus}"
 RULE_DIR = "$DATA_ROOT/metric/prometheus/rules"
 TMP_RULE="/tmp/test_rule.yml"
 #=============================
 # 辅助函数
 #=============================
 GREEN="\033[32m"; RED="\033[31m"; YELLOW="\033[33m"; RESET="\033[0m"
 log_info()    { echo -e "${YELLOW}[INFO]${RESET} $1"; }
 log_success() { echo -e "${GREEN}[OK]${RESET} $1"; }
 log_error()   { echo -e "${RED}[ERROR]${RESET} $1"; }
 fail_exit() { log_error "$1"; exit 1; }
 #=============================
 # Step 1: 检查 Alertmanager 是否可访问
 #=============================
 log_info "检查 Alertmanager 状态..."
 if curl -sSf "${ALERT_URL}/api/v2/status" >/dev/null 2>&1; then
  log_success "Alertmanager 服务正常 (${ALERT_URL})"
 else
  fail_exit "无法访问 Alertmanager，请检查端口映射与容器状态。"
 fi
 #=============================
 # Step 2: 手动发送测试告警
 #=============================
 log_info "发送手动测试告警..."
 curl -s -XPOST "${ALERT_URL}/api/v2/alerts" -H "Content-Type: application/json" -d '[
  {
    "labels": {
      "alertname": "ManualTestAlert",
      "severity": "info"
    },
    "annotations": {
      "summary": "This is a test alert from deploy verification"
    },
    "startsAt": "'$(date -Iseconds)'"
  }
 ]' >/dev/null && log_success "测试告警已成功发送到 Alertmanager"
 #=============================
 # Step 3: 检查 Prometheus 配置中是否包含 Alertmanager
 #=============================
 log_info "检查 Prometheus 是否配置了 Alertmanager..."
 if curl -s "${PROM_URL}/api/v1/status/config" | grep -q "alertmanagers"; then
  log_success "Prometheus 已配置 Alertmanager 目标"
 else
  fail_exit "Prometheus 未配置 Alertmanager，请检查 prometheus.yml"
 fi
 #=============================
 # Step 4: 创建并加载测试告警规则
 #=============================
 log_info "创建临时测试规则 ${TMP_RULE} ..."
 cat <<EOF > "${TMP_RULE}"
 groups:
 - name: deploy-verify-group
  rules:
  - alert: DeployVerifyAlert
    expr: vector(1)
    labels:
      severity: warning
    annotations:
      summary: "Deployment verification alert"
 EOF
 mkdir -p "${RULE_DIR}"
 cp "${TMP_RULE}" "${RULE_DIR}/test_rule.yml"
 log_info "重载 Prometheus 以加载新规则..."
 if curl -s -X POST "${PROM_URL}/-/reload" >/dev/null; then
  log_success "Prometheus 已重载规则"
 else
  fail_exit "Prometheus reload 失败，请检查 API 可访问性。"
 fi
 #=============================
 # Step 5: 等待并验证 Alertmanager 是否收到告警
 #=============================
 log_info "等待告警触发 (约5秒)..."
 sleep 5
 if curl -s "${ALERT_URL}/api/v2/alerts" | grep -q "DeployVerifyAlert"; then
  log_success "Prometheus → Alertmanager 告警链路验证成功"
 else
  fail_exit "未在 Alertmanager 中检测到 DeployVerifyAlert，请检查网络或配置。"
 fi
 #=============================
 # Step 6: 清理测试规则
 #=============================
 log_info "清理临时测试规则..."
 rm -f "${RULE_DIR}/test_rule.yml" "${TMP_RULE}"
 curl -s -X POST "${PROM_URL}/-/reload" >/dev/null \
  && log_success "Prometheus 已清理验证规则" \
  || log_error "Prometheus reload 清理失败，请手动确认。"
 log_success "部署验证全部通过！Prometheus ↔ Alertmanager 通信正常。"
--- a/src/bind/.gitignore
+++ b/src/bind/.gitignore
@ -0,0 +1,2 @@
 images/
--- a/src/bind/build/Dockerfile
+++ b/src/bind/build/Dockerfile
@ -0,0 +1,90 @@
 FROM ubuntu:22.04
 # Set timezone and avoid interactive prompts
 ENV DEBIAN_FRONTEND=noninteractive
 ENV TZ=Asia/Shanghai
 # 设置构建参数
 ARG USE_INTRANET=false
 ARG ARGUS_BUILD_UID=2133
 ARG ARGUS_BUILD_GID=2015
 ENV ARGUS_BUILD_UID=${ARGUS_BUILD_UID} \
    ARGUS_BUILD_GID=${ARGUS_BUILD_GID}
 # 配置内网 apt 源 (如果指定了内网选项)
 RUN if [ "$USE_INTRANET" = "true" ]; then \
        echo "Configuring intranet apt sources..." && \
        cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
        echo "deb [trusted=yes] http://10.68.64.1/ubuntu2204/ jammy main" > /etc/apt/sources.list && \
        echo 'Acquire::https::Verify-Peer "false";' > /etc/apt/apt.conf.d/99disable-ssl-check && \
        echo 'Acquire::https::Verify-Host "false";' >> /etc/apt/apt.conf.d/99disable-ssl-check; \
    fi
 # Update package list and install required packages
 RUN apt-get update && \
    apt-get install -y \
    bind9 \
    bind9utils \
    dnsutils \
    bind9-doc \
    supervisor \
    net-tools \
    inetutils-ping \
    vim \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*
 # 调整 bind 用户与用户组 ID 以匹配宿主机配置
 RUN set -eux; \
    current_gid="$(getent group bind | awk -F: '{print $3}')"; \
    if [ -z "$current_gid" ]; then \
        groupadd -g "${ARGUS_BUILD_GID}" bind; \
    elif [ "$current_gid" != "${ARGUS_BUILD_GID}" ]; then \
        groupmod -g "${ARGUS_BUILD_GID}" bind; \
    fi; \
    if id bind >/dev/null 2>&1; then \
        current_uid="$(id -u bind)"; \
        if [ "$current_uid" != "${ARGUS_BUILD_UID}" ]; then \
            usermod -u "${ARGUS_BUILD_UID}" bind; \
        fi; \
    else \
        useradd -m -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" bind; \
    fi; \
    chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /var/cache/bind /var/lib/bind
 # 配置部署时使用的apt源
 RUN if [ "$USE_INTRANET" = "true" ]; then \
 	echo "deb [trusted=yes] https://10.92.132.52/mirrors/ubuntu2204/ jammy main" > /etc/apt/sources.list; \
    fi
 # Create supervisor configuration directory
 RUN mkdir -p /etc/supervisor/conf.d
 # Copy supervisor configuration
 COPY src/bind/build/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
 # Copy BIND9 configuration files
 COPY src/bind/build/named.conf.local /etc/bind/named.conf.local
 COPY src/bind/build/db.argus.com /etc/bind/db.argus.com
 # Copy startup and reload scripts
 COPY src/bind/build/startup.sh /usr/local/bin/startup.sh
 COPY src/bind/build/reload-bind9.sh /usr/local/bin/reload-bind9.sh
 COPY src/bind/build/argus_dns_sync.sh /usr/local/bin/argus_dns_sync.sh
 COPY src/bind/build/update-dns.sh /usr/local/bin/update-dns.sh
 # Make scripts executable
 RUN chmod +x /usr/local/bin/startup.sh /usr/local/bin/reload-bind9.sh  /usr/local/bin/argus_dns_sync.sh /usr/local/bin/update-dns.sh
 # Set proper ownership for BIND9 files
 RUN chown bind:bind /etc/bind/named.conf.local /etc/bind/db.argus.com
 # Expose DNS port
 EXPOSE 53/tcp 53/udp
 # Use root user as requested
 USER root
 # Start with startup script
 CMD ["/usr/local/bin/startup.sh"]
--- a/src/bind/build/argus_dns_sync.sh
+++ b/src/bind/build/argus_dns_sync.sh
@ -0,0 +1,106 @@
 #!/usr/bin/env bash
 set -euo pipefail
 WATCH_DIR="/private/argus/etc"
 ZONE_DB="/private/argus/bind/db.argus.com"
 LOCKFILE="/var/lock/argus_dns_sync.lock"
 BACKUP_DIR="/private/argus/bind/.backup"
 SLEEP_SECONDS=10
 RELOAD_SCRIPT="/usr/local/bin/reload-bind9.sh"   # 这里放你已有脚本的路径
 mkdir -p "$(dirname "$LOCKFILE")" "$BACKUP_DIR"
 BACKUP_UID="${ARGUS_BUILD_UID:-2133}"
 BACKUP_GID="${ARGUS_BUILD_GID:-2015}"
 chown -R "$BACKUP_UID:$BACKUP_GID" "$BACKUP_DIR" 2>/dev/null || true
 is_ipv4() {
  local ip="$1"
  [[ "$ip" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]] || return 1
  IFS='.' read -r a b c d <<<"$ip"
  for n in "$a" "$b" "$c" "$d"; do
    (( n >= 0 && n <= 255 )) || return 1
  done
  return 0
 }
 get_current_ip() {
  local name="$1"
  sed -n -E "s/^${name}[[:space:]]+IN[[:space:]]+A[[:space:]]+([0-9.]+)[[:space:]]*$/\1/p" "$ZONE_DB" | head -n1
 }
 upsert_record() {
  local name="$1"
  local new_ip="$2"
  local ts
  ts="$(date +%Y%m%d-%H%M%S)"
  local changed=0
  cp -a "$ZONE_DB" "$BACKUP_DIR/db.argus.com.$ts.bak"
  chown "$BACKUP_UID:$BACKUP_GID" "$BACKUP_DIR/db.argus.com.$ts.bak" 2>/dev/null || true
  local cur_ip
  cur_ip="$(get_current_ip "$name" || true)"
  if [[ -z "$cur_ip" ]]; then
    # Ensure the file ends with a newline before adding new record
    if [[ -s "$ZONE_DB" ]] && [[ $(tail -c1 "$ZONE_DB" | wc -l) -eq 0 ]]; then
      echo "" >> "$ZONE_DB"
    fi
    printf "%-20s IN A %s\n" "$name" "$new_ip" >> "$ZONE_DB"
    echo "[ADD] ${name} -> ${new_ip}"
    changed=1
  elif [[ "$cur_ip" != "$new_ip" ]]; then
    awk -v n="$name" -v ip="$new_ip" '
      {
        if ($1==n && $2=="IN" && $3=="A") {
          printf "%-20s IN A %s\n", n, ip
        } else {
          print
        }
      }
    ' "$ZONE_DB" > "${ZONE_DB}.tmp" && mv "${ZONE_DB}.tmp" "$ZONE_DB"
    echo "[UPDATE] ${name}: ${cur_ip} -> ${new_ip}"
    changed=1
  else
    echo "[SKIP] ${name} unchanged (${new_ip})"
  fi
  if [[ $changed -eq 1 ]]; then
    return 0
  fi
  return 1
 }
 while true; do
  exec 9>"$LOCKFILE"
  if flock -n 9; then
    shopt -s nullglob
    NEED_RELOAD=0
  for f in "$WATCH_DIR"/*.argus.com; do
      base="$(basename "$f")"
      name="${base%.argus.com}"
      ip="$(grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' "$f" | tail -n1 || true)"
      if [[ -z "$ip" ]] || ! is_ipv4 "$ip"; then
        echo "[WARN] $f 未找到有效 IPv4，跳过"
        continue
      fi
      if upsert_record "$name" "$ip"; then
        NEED_RELOAD=1
      fi
    done
    if [[ $NEED_RELOAD -eq 1 ]]; then
      echo "[INFO] 检测到 db.argus.com 变更，执行 reload-bind9.sh"
      bash "$RELOAD_SCRIPT"
    fi
    flock -u 9
  else
    echo "[INFO] 已有同步任务在运行，跳过本轮"
  fi
  sleep "$SLEEP_SECONDS"
 done
--- a/src/bind/build/db.argus.com
+++ b/src/bind/build/db.argus.com
@ -0,0 +1,16 @@
 $TTL    604800
@       IN      SOA     ns1.argus.com. admin.argus.com. (
                              2         ; Serial
                         604800         ; Refresh
                          86400         ; Retry
                        2419200         ; Expire
                         604800 )       ; Negative Cache TTL
 ; 定义 DNS 服务器
@       IN      NS      ns1.argus.com.
 ; 定义 ns1 主机
 ns1     IN      A       127.0.0.1
 ; 定义 web 指向 12.4.5.6
 web     IN      A       12.4.5.6
--- a/src/bind/build/dns-monitor.sh
+++ b/src/bind/build/dns-monitor.sh
@ -0,0 +1,71 @@
 #!/bin/bash
 # DNS监控脚本 - 每10秒检查dns.conf是否有变化
 # 如果有变化则执行update-dns.sh脚本
 DNS_CONF="/private/argus/etc/dns.conf"
 DNS_BACKUP="/tmp/dns.conf.backup"
 UPDATE_SCRIPT="/private/argus/etc/update-dns.sh"
 LOG_FILE="/var/log/supervisor/dns-monitor.log"
 # 确保日志文件存在
 touch "$LOG_FILE"
 log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') [DNS-Monitor] $1" >> "$LOG_FILE"
 }
 log_message "DNS监控脚本启动"
 log_message "删除DNS备份文件（如果存在）"
 rm -f $DNS_BACKUP
 while true; do
    if [ -f "$DNS_CONF" ]; then
        if [ -f "$DNS_BACKUP" ]; then
            # 比较文件内容
            if ! cmp -s "$DNS_CONF" "$DNS_BACKUP"; then
                log_message "检测到DNS配置变化"
                # 更新备份文件
                cp "$DNS_CONF" "$DNS_BACKUP"
                # 执行更新脚本
                if [ -x "$UPDATE_SCRIPT" ]; then
                    log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
                    "$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
                    if [ $? -eq 0 ]; then
                        log_message "DNS更新脚本执行成功"
                    else
                        log_message "DNS更新脚本执行失败"
                    fi
                else
                    log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
                fi
            fi
        else
            # 第一次检测到配置文件，执行更新脚本
            if [ -x "$UPDATE_SCRIPT" ]; then
                log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
                "$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
                if [ $? -eq 0 ]; then
                    log_message "DNS更新脚本执行成功"
 		    # 第一次运行，创建备份并执行更新
 		    cp "$DNS_CONF" "$DNS_BACKUP"
 		    log_message "创建DNS配置备份文件"
                else
                    log_message "DNS更新脚本执行失败"
                fi
            else
                log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
            fi
        fi
    else
        log_message "警告: DNS配置文件不存在: $DNS_CONF"
    fi
    sleep 10
 done
--- a/src/bind/build/named.conf.local
+++ b/src/bind/build/named.conf.local
@ -0,0 +1,4 @@
 zone "argus.com" {
    type master;
    file "/etc/bind/db.argus.com";
 };
--- a/src/bind/build/reload-bind9.sh
+++ b/src/bind/build/reload-bind9.sh
@ -0,0 +1,27 @@
 #!/bin/bash
 echo "Reloading BIND9 configuration..."
 # Check if configuration files are valid
 echo "Checking named.conf.local syntax..."
 if ! named-checkconf /etc/bind/named.conf.local; then
    echo "ERROR: named.conf.local has syntax errors!"
    exit 1
 fi
 echo "Checking zone file syntax..."
 if ! named-checkzone argus.com /etc/bind/db.argus.com; then
    echo "ERROR: db.argus.com has syntax errors!"
    exit 1
 fi
 # Reload BIND9 via supervisor
 echo "Reloading BIND9 service..."
 supervisorctl restart bind9
 if [ $? -eq 0 ]; then
    echo "BIND9 reloaded successfully!"
 else
    echo "ERROR: Failed to reload BIND9!"
    exit 1
 fi
--- a/src/bind/build/startup.sh
+++ b/src/bind/build/startup.sh
@ -0,0 +1,42 @@
 #!/bin/bash
 # Set /private permissions to 777 as requested
 chmod 777 /private 2>/dev/null || true
 # Create persistent directories for BIND9 configs and DNS sync
 mkdir -p /private/argus/bind
 mkdir -p /private/argus/etc
 chown bind:bind /private/argus 2>/dev/null || true
 chown -R bind:bind /private/argus/bind /private/argus/etc
 # Copy configuration files to persistent storage if they don't exist
 if [ ! -f /private/argus/bind/named.conf.local ]; then
    cp /etc/bind/named.conf.local /private/argus/bind/named.conf.local
 fi
 if [ ! -f /private/argus/bind/db.argus.com ]; then
    cp /etc/bind/db.argus.com /private/argus/bind/db.argus.com
 fi
 # Copy update-dns.sh to /private/argus/etc/
 cp /usr/local/bin/update-dns.sh /private/argus/etc/update-dns.sh
 chown bind:bind /private/argus/etc/update-dns.sh
 chmod a+x /private/argus/etc/update-dns.sh
 # Create symlinks to use persistent configs
 ln -sf /private/argus/bind/named.conf.local /etc/bind/named.conf.local
 ln -sf /private/argus/bind/db.argus.com /etc/bind/db.argus.com
 # Set proper ownership
 chown bind:bind /private/argus/bind/named.conf.local /private/argus/bind/db.argus.com
 # 记录容器ip地址更新到dns.conf
 IP=`ifconfig | grep -A 1 eth0 | grep inet | awk '{print $2}'`
 echo current IP: ${IP}
 echo ${IP} > /private/argus/etc/dns.conf
 # Create supervisor log directory
 mkdir -p /var/log/supervisor
 # Start supervisor
 exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf
--- a/src/bind/build/supervisord.conf
+++ b/src/bind/build/supervisord.conf
@ -0,0 +1,37 @@
 [unix_http_server]
 file=/var/run/supervisor.sock
 chmod=0700
 [supervisord]
 nodaemon=true
 user=root
 logfile=/var/log/supervisor/supervisord.log
 pidfile=/var/run/supervisord.pid
 [rpcinterface:supervisor]
 supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
 [supervisorctl]
 serverurl=unix:///var/run/supervisor.sock
 [program:bind9]
 command=/usr/sbin/named -g -c /etc/bind/named.conf -u bind
 user=bind
 autostart=true
 autorestart=true
 stderr_logfile=/var/log/supervisor/bind9.err.log
 stdout_logfile=/var/log/supervisor/bind9.out.log
 priority=10
 [program:argus-dns-sync]
 command=/usr/local/bin/argus_dns_sync.sh
 autostart=true
 autorestart=true
 startsecs=3
 stopsignal=TERM
 user=root
 stdout_logfile=/var/log/argus_dns_sync.out.log
 stderr_logfile=/var/log/argus_dns_sync.err.log
 ; 根据环境调整环境变量（可选）
 ; environment=RNDC_RELOAD="yes"
--- a/src/bind/build/update-dns.sh
+++ b/src/bind/build/update-dns.sh
@ -0,0 +1,31 @@
 #!/bin/sh
 # update-dns.sh
 # 从 /private/argus/etc/dns.conf 读取 IP，写入 /etc/resolv.conf
 DNS_CONF="/private/argus/etc/dns.conf"
 RESOLV_CONF="/etc/resolv.conf"
 # 检查配置文件是否存在
 if [ ! -f "$DNS_CONF" ]; then
  echo "配置文件不存在: $DNS_CONF" >&2
  exit 1
 fi
 # 生成 resolv.conf 内容
 {
  while IFS= read -r ip; do
    # 跳过空行和注释
    case "$ip" in
      \#*) continue ;;
      "") continue ;;
    esac
    echo "nameserver $ip"
  done < "$DNS_CONF"
 } > "$RESOLV_CONF".tmp
 # 替换写入 /etc/resolv.conf
 cat "$RESOLV_CONF".tmp > "$RESOLV_CONF"
 rm -f "$RESOLV_CONF".tmp
 echo "已更新 $RESOLV_CONF"
--- a/src/bind/tests/docker-compose.yml
+++ b/src/bind/tests/docker-compose.yml
@ -0,0 +1,16 @@
 services:
  bind9:
    image: argus-bind9:latest
    container_name: argus-bind9-test
    ports:
      - "${HOST_DNS_PORT:-1053}:53/tcp"
      - "${HOST_DNS_PORT:-1053}:53/udp"
    volumes:
      - ./private:/private
    restart: unless-stopped
    networks:
      - bind-test-network
 networks:
  bind-test-network:
    driver: bridge
--- a/src/bind/tests/scripts/00_e2e_test.sh
+++ b/src/bind/tests/scripts/00_e2e_test.sh
@ -0,0 +1,118 @@
 #!/bin/bash
 # End-to-end test for BIND9 DNS server
 # This script runs all tests in sequence to validate the complete functionality
 # Usage: ./00_e2e_test.sh
 set -e
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 export HOST_DNS_PORT
 echo "=========================================="
 echo "BIND9 DNS Server End-to-End Test Suite"
 echo "=========================================="
 # Track test results
 total_tests=0
 passed_tests=0
 failed_tests=0
 # Function to run a test step
 run_test_step() {
    local step_name="$1"
    local script_name="$2"
    local description="$3"
    echo ""
    echo "[$step_name] $description"
    echo "$(printf '=%.0s' {1..50})"
    ((total_tests++))
    if [ ! -f "$SCRIPT_DIR/$script_name" ]; then
        echo "✗ Test script not found: $script_name"
        ((failed_tests++))
        return 1
    fi
    # Make sure script is executable
    chmod +x "$SCRIPT_DIR/$script_name"
    # Run the test
    echo "Executing: $SCRIPT_DIR/$script_name"
    if "$SCRIPT_DIR/$script_name"; then
        echo "✓ $step_name completed successfully"
        ((passed_tests++))
        return 0
    else
        echo "✗ $step_name failed"
        ((failed_tests++))
        return 1
    fi
 }
 # Cleanup any previous test environment (but preserve the Docker image)
 echo ""
 echo "[SETUP] Cleaning up any previous test environment..."
 if [ -f "$SCRIPT_DIR/05_cleanup.sh" ]; then
    chmod +x "$SCRIPT_DIR/05_cleanup.sh"
    "$SCRIPT_DIR/05_cleanup.sh" || true
 fi
 echo ""
 echo "Starting BIND9 DNS server end-to-end test sequence..."
 # Test sequence
 run_test_step "TEST-01" "01_start_container.sh" "Start BIND9 container" || true
 run_test_step "TEST-02" "02_dig_test.sh" "Initial DNS resolution test" || true
 run_test_step "TEST-03" "03_reload_test.sh" "Configuration reload with IP modification" || true
 run_test_step "TEST-03.5" "03.5_dns_sync_test.sh" "DNS auto-sync functionality test" || true
 run_test_step "TEST-04" "04_persistence_test.sh" "Configuration persistence after restart" || true
 # Final cleanup (but preserve logs for review)
 echo ""
 echo "[CLEANUP] Cleaning up test environment..."
 run_test_step "CLEANUP" "05_cleanup.sh" "Clean up containers and networks" || true
 # Test summary
 echo ""
 echo "=========================================="
 echo "TEST SUMMARY"
 echo "=========================================="
 echo "Total tests: $total_tests"
 echo "Passed: $passed_tests"
 echo "Failed: $failed_tests"
 if [ $failed_tests -eq 0 ]; then
    echo ""
    echo "✅ ALL TESTS PASSED!"
    echo ""
    echo "BIND9 DNS server functionality validated:"
    echo "  ✓ Container startup and basic functionality"
    echo "  ✓ DNS resolution for configured domains"
    echo "  ✓ Configuration modification and reload"
    echo "  ✓ DNS auto-sync from IP files"
    echo "  ✓ Configuration persistence across restarts"
    echo "  ✓ Cleanup and resource management"
    echo ""
    echo "The BIND9 DNS server is ready for production use."
    exit 0
 else
    echo ""
    echo "❌ SOME TESTS FAILED!"
    echo ""
    echo "Please review the test output above to identify and fix issues."
    echo "You may need to:"
    echo "  - Check Docker installation and permissions"
    echo "  - Verify network connectivity"
    echo "  - Review BIND9 configuration files"
    echo "  - Check system resources and port availability"
    exit 1
 fi
--- a/src/bind/tests/scripts/01_start_container.sh
+++ b/src/bind/tests/scripts/01_start_container.sh
@ -0,0 +1,42 @@
 #!/bin/bash
 # Start BIND9 test container
 # Usage: ./01_start_container.sh
 set -e
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_DIR="$(dirname "$SCRIPT_DIR")"
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 export HOST_DNS_PORT
 cd "$TEST_DIR"
 echo "Starting BIND9 test container..."
 # Ensure private directory exists with proper permissions
 mkdir -p private/argus/bind
 mkdir -p private/argus/etc
 chmod 777 private
 # Start the container
 docker compose up -d
 echo "Waiting for container to be ready..."
 sleep 5
 # Check if container is running
 if docker compose ps | grep -q "Up"; then
    echo "✓ Container started successfully"
    echo "Container status:"
    docker compose ps
 else
    echo "✗ Failed to start container"
    docker compose logs
    exit 1
 fi
 echo ""
 echo "BIND9 test environment is ready!"
 echo "DNS server listening on localhost:${HOST_DNS_PORT}"
--- a/src/bind/tests/scripts/02_dig_test.sh
+++ b/src/bind/tests/scripts/02_dig_test.sh
@ -0,0 +1,75 @@
 #!/bin/bash
 # Test DNS resolution using dig
 # Usage: ./02_dig_test.sh
 set -e
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 echo "Testing DNS resolution with dig..."
 echo "Using DNS server localhost:${HOST_DNS_PORT}"
 # Function to test DNS query
 test_dns_query() {
    local hostname="$1"
    local expected_ip="$2"
    local description="$3"
    echo ""
    echo "Testing: $description"
    echo "Query: $hostname.argus.com"
    echo "Expected IP: $expected_ip"
    # Perform dig query
    result=$(dig @localhost -p "$HOST_DNS_PORT" "$hostname".argus.com A +short 2>/dev/null || echo "QUERY_FAILED")
    if [ "$result" = "QUERY_FAILED" ]; then
        echo "✗ DNS query failed"
        return 1
    elif [ "$result" = "$expected_ip" ]; then
        echo "✓ DNS query successful: $result"
        return 0
    else
        echo "✗ DNS query returned unexpected result: $result"
        return 1
    fi
 }
 # Check if dig is available
 if ! command -v dig &> /dev/null; then
    echo "Installing dig (dnsutils)..."
    apt-get update && apt-get install -y dnsutils
 fi
 # Check if container is running
 if ! docker compose ps | grep -q "Up"; then
    echo "Error: BIND9 container is not running"
    echo "Please start the container first with: ./01_start_container.sh"
    exit 1
 fi
 echo "=== DNS Resolution Tests ==="
 # Test cases based on current configuration
 failed_tests=0
 # Test ns1.argus.com -> 127.0.0.1
 if ! test_dns_query "ns1" "127.0.0.1" "Name server resolution"; then
    ((failed_tests++))
 fi
 # Test web.argus.com -> 12.4.5.6
 if ! test_dns_query "web" "12.4.5.6" "Web server resolution"; then
    ((failed_tests++))
 fi
 echo ""
 echo "=== Test Summary ==="
 if [ $failed_tests -eq 0 ]; then
    echo "✓ All DNS tests passed!"
    exit 0
 else
    echo "✗ $failed_tests test(s) failed"
    exit 1
 fi
--- a/src/bind/tests/scripts/03.5_dns_sync_test.sh
+++ b/src/bind/tests/scripts/03.5_dns_sync_test.sh
@ -0,0 +1,259 @@
 #!/bin/bash
 # Test DNS auto-sync functionality using argus_dns_sync.sh
 # This test validates the automatic DNS record updates from IP files
 # Usage: ./03.5_dns_sync_test.sh
 set -e
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_DIR="$(dirname "$SCRIPT_DIR")"
 echo "=== DNS Auto-Sync Functionality Test ==="
 echo "Using DNS server localhost:${HOST_DNS_PORT}"
 # Check if container is running
 if ! docker compose ps | grep -q "Up"; then
    echo "Error: BIND9 container is not running"
    echo "Please start the container first with: ./01_start_container.sh"
    exit 1
 fi
 # Check if dig is available
 if ! command -v dig &> /dev/null; then
    echo "Installing dig (dnsutils)..."
    apt-get update && apt-get install -y dnsutils
 fi
 # Function to test DNS query
 test_dns_query() {
    local hostname="$1"
    local expected_ip="$2"
    local description="$3"
    echo "Testing: $description"
    echo "Query: $hostname.argus.com -> Expected: $expected_ip"
    # Wait a moment for DNS cache
    sleep 2
    result=$(dig @localhost -p "$HOST_DNS_PORT" "$hostname".argus.com A +short 2>/dev/null || echo "QUERY_FAILED")
    if [ "$result" = "$expected_ip" ]; then
        echo "✓ $result"
        return 0
    else
        echo "✗ Got: $result, Expected: $expected_ip"
        return 1
    fi
 }
 # Function to wait for sync to complete
 wait_for_sync() {
    local timeout=15
    local elapsed=0
    echo "Waiting for DNS sync to complete (max ${timeout}s)..."
    while [ $elapsed -lt $timeout ]; do
        if docker compose exec bind9 test -f /var/lock/argus_dns_sync.lock; then
            echo "Sync process is running..."
        else
            echo "Sync completed"
            sleep 2  # Extra wait for DNS propagation
            return 0
        fi
        sleep 2
        elapsed=$((elapsed + 2))
    done
    echo "Warning: Sync may still be running after ${timeout}s"
    return 0
 }
 echo ""
 echo "Step 1: Preparing test environment..."
 # Ensure required directories exist
 docker compose exec bind9 mkdir -p /private/argus/etc
 docker compose exec bind9 mkdir -p /private/argus/bind/.backup
 # Backup original configuration if it exists
 docker compose exec bind9 test -f /private/argus/bind/db.argus.com && \
    docker compose exec bind9 cp /private/argus/bind/db.argus.com /private/argus/bind/db.argus.com.backup.test || true
 # Ensure initial configuration is available (may already be symlinked)
 docker compose exec bind9 test -f /private/argus/bind/db.argus.com || \
    docker compose exec bind9 cp /etc/bind/db.argus.com /private/argus/bind/db.argus.com
 echo "✓ Test environment prepared"
 echo ""
 echo "Step 2: Testing initial DNS configuration..."
 # Get current IP for web.argus.com (may have been changed by previous tests)
 current_web_ip=$(dig @localhost -p "$HOST_DNS_PORT" web.argus.com A +short 2>/dev/null || echo "UNKNOWN")
 echo "Current web.argus.com IP: $current_web_ip"
 # Test that DNS is working (regardless of specific IP)
 if [ "$current_web_ip" = "UNKNOWN" ] || [ -z "$current_web_ip" ]; then
    echo "DNS resolution not working for web.argus.com"
    exit 1
 fi
 echo "✓ DNS resolution is working"
 echo ""
 echo "Step 3: Creating IP files for auto-sync..."
 # Create test IP files in the watch directory
 echo "Creating test1.argus.com with IP 10.0.0.100"
 docker compose exec bind9 bash -c 'echo "10.0.0.100" > /private/argus/etc/test1.argus.com'
 echo "Creating test2.argus.com with IP 10.0.0.200"
 docker compose exec bind9 bash -c 'echo "test2 service running on 10.0.0.200" > /private/argus/etc/test2.argus.com'
 echo "Creating api.argus.com with IP 192.168.1.50"
 docker compose exec bind9 bash -c 'echo "API server: 192.168.1.50 port 8080" > /private/argus/etc/api.argus.com'
 echo "✓ IP files created"
 echo ""
 echo "Step 4: Checking DNS sync process..."
 # Check if DNS sync process is already running (via supervisord)
 if docker compose exec bind9 pgrep -f argus_dns_sync.sh > /dev/null; then
    echo "✓ DNS sync process already running (via supervisord)"
 else
    echo "Starting DNS sync process manually..."
    # Start the DNS sync process in background if not running
    docker compose exec -d bind9 /usr/local/bin/argus_dns_sync.sh
    echo "✓ DNS sync process started manually"
 fi
 # Wait for first sync cycle
 wait_for_sync
 echo ""
 echo "Step 5: Testing auto-synced DNS records..."
 failed_tests=0
 # Test new DNS records created by auto-sync
 if ! test_dns_query "test1" "10.0.0.100" "Auto-synced test1.argus.com"; then
    ((failed_tests++))
 fi
 if ! test_dns_query "test2" "10.0.0.200" "Auto-synced test2.argus.com"; then
    ((failed_tests++))
 fi
 if ! test_dns_query "api" "192.168.1.50" "Auto-synced api.argus.com"; then
    ((failed_tests++))
 fi
 # Verify original records still work (use current IP from earlier)
 if ! test_dns_query "web" "$current_web_ip" "Original web.argus.com still working"; then
    ((failed_tests++))
 fi
 if ! test_dns_query "ns1" "127.0.0.1" "Original ns1.argus.com still working"; then
    ((failed_tests++))
 fi
 echo ""
 echo "Step 6: Testing IP update functionality..."
 # Update an existing IP file
 echo "Updating test1.argus.com IP from 10.0.0.100 to 10.0.0.150"
 docker compose exec bind9 bash -c 'echo "10.0.0.150" > /private/argus/etc/test1.argus.com'
 # Wait for sync
 wait_for_sync
 # Test updated record
 if ! test_dns_query "test1" "10.0.0.150" "Updated test1.argus.com IP"; then
    ((failed_tests++))
 fi
 echo ""
 echo "Step 7: Testing invalid IP handling..."
 # Create file with invalid IP
 echo "Creating invalid.argus.com with invalid IP"
 docker compose exec bind9 bash -c 'echo "this is not an IP address" > /private/argus/etc/invalid.argus.com'
 # Wait for sync (should skip invalid IP)
 wait_for_sync
 # Verify invalid record was not added (should fail to resolve)
 result=$(dig @localhost -p "$HOST_DNS_PORT" invalid.argus.com A +short 2>/dev/null || echo "NO_RESULT")
 if [ "$result" = "NO_RESULT" ] || [ -z "$result" ]; then
    echo "✓ Invalid IP correctly ignored"
 else
    echo "✗ Invalid IP was processed: $result"
    ((failed_tests++))
 fi
 echo ""
 echo "Step 8: Verifying backup functionality..."
 # Check if backups were created
 backup_count=$(docker compose exec bind9 ls -1 /private/argus/bind/.backup/ | wc -l || echo "0")
 if [ "$backup_count" -gt 0 ]; then
    echo "✓ Configuration backups created ($backup_count files)"
    # Show latest backup
    docker compose exec bind9 ls -la /private/argus/bind/.backup/ | tail -1
 else
    echo "✗ No backup files found"
    ((failed_tests++))
 fi
 echo ""
 echo "Step 9: Cleanup..."
 # Note: We don't stop the DNS sync process since it's managed by supervisord
 echo "Note: DNS sync process will continue running (managed by supervisord)"
 # Clean up test files
 docker compose exec bind9 rm -f /private/argus/etc/test1.argus.com
 docker compose exec bind9 rm -f /private/argus/etc/test2.argus.com
 docker compose exec bind9 rm -f /private/argus/etc/api.argus.com
 docker compose exec bind9 rm -f /private/argus/etc/invalid.argus.com
 # Restore original configuration if backup exists
 docker compose exec bind9 test -f /private/argus/bind/db.argus.com.backup.test && \
    docker compose exec bind9 cp /private/argus/bind/db.argus.com.backup.test /private/argus/bind/db.argus.com && \
    docker compose exec bind9 rm /private/argus/bind/db.argus.com.backup.test || true
 # Reload original configuration
 docker compose exec bind9 /usr/local/bin/reload-bind9.sh
 echo "✓ Cleanup completed"
 echo ""
 echo "=== DNS Auto-Sync Test Summary ==="
 if [ $failed_tests -eq 0 ]; then
    echo "✅ All DNS auto-sync tests passed!"
    echo ""
    echo "Validated functionality:"
    echo "  ✓ Automatic DNS record creation from IP files"
    echo "  ✓ IP address extraction from various file formats"
    echo "  ✓ Dynamic DNS record updates"
    echo "  ✓ Invalid IP address handling"
    echo "  ✓ Configuration backup mechanism"
    echo "  ✓ Preservation of existing DNS records"
    echo ""
    echo "The DNS auto-sync functionality is working correctly!"
    exit 0
 else
    echo "❌ $failed_tests DNS auto-sync test(s) failed!"
    echo ""
    echo "Please check:"
    echo "  - argus_dns_sync.sh script configuration"
    echo "  - File permissions in /private/argus/etc/"
    echo "  - BIND9 reload functionality"
    echo "  - Network connectivity and DNS resolution"
    exit 1
 fi
--- a/src/bind/tests/scripts/03_reload_test.sh
+++ b/src/bind/tests/scripts/03_reload_test.sh
@ -0,0 +1,115 @@
 #!/bin/bash
 # Test DNS configuration reload with IP modification
 # Usage: ./03_reload_test.sh
 set -e
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_DIR="$(dirname "$SCRIPT_DIR")"
 echo "=== DNS Configuration Reload Test ==="
 echo "Using DNS server localhost:${HOST_DNS_PORT}"
 # Check if container is running
 if ! docker compose ps | grep -q "Up"; then
    echo "Error: BIND9 container is not running"
    echo "Please start the container first with: ./01_start_container.sh"
    exit 1
 fi
 # Check if dig is available
 if ! command -v dig &> /dev/null; then
    echo "Installing dig (dnsutils)..."
    apt-get update && apt-get install -y dnsutils
 fi
 # Function to test DNS query
 test_dns_query() {
    local hostname="$1"
    local expected_ip="$2"
    local description="$3"
    echo "Testing: $description"
    echo "Query: $hostname.argus.com -> Expected: $expected_ip"
    result=$(dig @localhost -p "$HOST_DNS_PORT" "$hostname".argus.com A +short 2>/dev/null || echo "QUERY_FAILED")
    if [ "$result" = "$expected_ip" ]; then
        echo "✓ $result"
        return 0
    else
        echo "✗ Got: $result, Expected: $expected_ip"
        return 1
    fi
 }
 echo ""
 echo "Step 1: Testing initial DNS configuration..."
 # Test initial configuration
 if ! test_dns_query "web" "12.4.5.6" "Initial web.argus.com resolution"; then
    echo "Initial DNS test failed"
    exit 1
 fi
 echo ""
 echo "Step 2: Modifying DNS configuration..."
 # Backup original configuration
 cp "$TEST_DIR/private/argus/bind/db.argus.com" "$TEST_DIR/private/argus/bind/db.argus.com.backup" 2>/dev/null || true
 # Create new configuration with modified IP
 DB_FILE="$TEST_DIR/private/argus/bind/db.argus.com"
 # Check if persistent config exists, if not use from container
 if [ ! -f "$DB_FILE" ]; then
    echo "Persistent config not found, copying from container..."
    docker compose exec bind9 cp /etc/bind/db.argus.com /private/argus/bind/db.argus.com
    docker compose exec bind9 chown bind:bind /private/argus/bind/db.argus.com
 fi
 # Modify the IP address (12.4.5.6 -> 192.168.1.100)
 sed -i 's/12\.4\.5\.6/192.168.1.100/g' "$DB_FILE"
 # Increment serial number for DNS cache invalidation
 current_serial=$(grep -o "2[[:space:]]*;" "$DB_FILE" | grep -o "2")
 new_serial=$((current_serial + 1))
 sed -i "s/2[[:space:]]*;/${new_serial}         ;/" "$DB_FILE"
 echo "Modified configuration:"
 echo "- Changed web.argus.com IP: 12.4.5.6 -> 192.168.1.100"
 echo "- Updated serial number: $current_serial -> $new_serial"
 echo ""
 echo "Step 3: Reloading BIND9 configuration..."
 # Reload BIND9 configuration
 docker compose exec bind9 /usr/local/bin/reload-bind9.sh
 echo "Configuration reloaded"
 # Wait a moment for changes to take effect
 sleep 3
 echo ""
 echo "Step 4: Testing modified DNS configuration..."
 # Test modified configuration
 if ! test_dns_query "web" "192.168.1.100" "Modified web.argus.com resolution"; then
    echo "Modified DNS test failed"
    exit 1
 fi
 # Also verify ns1 still works
 if ! test_dns_query "ns1" "127.0.0.1" "ns1.argus.com still working"; then
    echo "ns1 DNS test failed after reload"
    exit 1
 fi
 echo ""
 echo "✓ DNS configuration reload test completed successfully!"
 echo "✓ IP address changed from 12.4.5.6 to 192.168.1.100"
 echo "✓ Configuration persisted and reloaded correctly"
--- a/src/bind/tests/scripts/04_persistence_test.sh
+++ b/src/bind/tests/scripts/04_persistence_test.sh
@ -0,0 +1,118 @@
 #!/bin/bash
 # Test configuration persistence after container restart
 # Usage: ./04_persistence_test.sh
 set -e
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_DIR="$(dirname "$SCRIPT_DIR")"
 echo "=== Configuration Persistence Test ==="
 echo "Using DNS server localhost:${HOST_DNS_PORT}"
 # Check if dig is available
 if ! command -v dig &> /dev/null; then
    echo "Installing dig (dnsutils)..."
    apt-get update && apt-get install -y dnsutils
 fi
 # Function to test DNS query
 test_dns_query() {
    local hostname="$1"
    local expected_ip="$2"
    local description="$3"
    echo "Testing: $description"
    echo "Query: $hostname.argus.com -> Expected: $expected_ip"
    result=$(dig @localhost -p "$HOST_DNS_PORT" "$hostname".argus.com A +short 2>/dev/null || echo "QUERY_FAILED")
    if [ "$result" = "$expected_ip" ]; then
        echo "✓ $result"
        return 0
    else
        echo "✗ Got: $result, Expected: $expected_ip"
        return 1
    fi
 }
 echo ""
 echo "Step 1: Stopping current container..."
 # Stop the container
 docker compose down
 echo "Container stopped"
 echo ""
 echo "Step 2: Verifying persistent configuration exists..."
 # Check if modified configuration exists
 DB_FILE="$TEST_DIR/private/argus/bind/db.argus.com"
 if [ ! -f "$DB_FILE" ]; then
    echo "✗ Persistent configuration file not found: $DB_FILE"
    exit 1
 fi
 # Check if the modified IP is in the configuration
 if grep -q "192.168.1.100" "$DB_FILE"; then
    echo "✓ Modified IP (192.168.1.100) found in persistent configuration"
 else
    echo "✗ Modified IP not found in persistent configuration"
    echo "Configuration content:"
    cat "$DB_FILE"
    exit 1
 fi
 echo ""
 echo "Step 3: Restarting container with persistent configuration..."
 # Start the container again
 docker compose up -d
 echo "Waiting for container to be ready..."
 sleep 5
 # Check if container is running
 if ! docker compose ps | grep -q "Up"; then
    echo "✗ Failed to restart container"
    docker compose logs
    exit 1
 fi
 echo "✓ Container restarted successfully"
 echo ""
 echo "Step 4: Testing DNS resolution after restart..."
 # Wait a bit more for DNS to be fully ready
 sleep 5
 # Test that the modified configuration is still active
 if ! test_dns_query "web" "192.168.1.100" "Persistent web.argus.com resolution"; then
    echo "✗ Persistent configuration test failed"
    exit 1
 fi
 # Also verify ns1 still works
 if ! test_dns_query "ns1" "127.0.0.1" "ns1.argus.com still working"; then
    echo "✗ ns1 DNS test failed after restart"
    exit 1
 fi
 echo ""
 echo "Step 5: Verifying configuration files are linked correctly..."
 # Check that the persistent files are properly linked
 echo "Checking file links in container:"
 docker compose exec bind9 ls -la /etc/bind/named.conf.local /etc/bind/db.argus.com
 echo ""
 echo "✓ Configuration persistence test completed successfully!"
 echo "✓ Modified IP (192.168.1.100) persisted after container restart"
 echo "✓ Configuration files properly linked to persistent storage"
 echo "✓ DNS resolution working correctly with persisted configuration"
--- a/src/bind/tests/scripts/05_cleanup.sh
+++ b/src/bind/tests/scripts/05_cleanup.sh
@ -0,0 +1,90 @@
 #!/bin/bash
 # Clean up test environment and containers
 # Usage: ./05_cleanup.sh [--full]
 set -e
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TEST_DIR="$(dirname "$SCRIPT_DIR")"
 HOST_DNS_PORT="${HOST_DNS_PORT:-1053}"
 export HOST_DNS_PORT
 # Parse command line arguments
 FULL_CLEANUP=true
 while [[ $# -gt 0 ]]; do
    case $1 in
        --full)
            FULL_CLEANUP=true
            shift
            ;;
        *)
            echo "Unknown option: $1"
            echo "Usage: $0 [--full]"
            echo "  --full: Also remove persistent data "
            exit 1
            ;;
    esac
 done
 cd "$TEST_DIR"
 echo "=== Cleaning up BIND9 test environment ==="
 echo ""
 echo "Step 1: Stopping and removing containers..."
 # Stop and remove containers
 docker compose down -v
 echo "✓ Containers stopped and removed"
 echo ""
 echo "Step 2: Removing Docker networks..."
 # Clean up networks
 docker network prune -f > /dev/null 2>&1 || true
 echo "✓ Docker networks cleaned"
 if [ "$FULL_CLEANUP" = true ]; then
    echo ""
    echo "Step 3: Removing persistent data..."
    # Remove persistent data directory
    if [ -d "private" ]; then
        rm -rf private
        echo "✓ Persistent data directory removed"
    else
        echo "✓ No persistent data directory found"
    fi
 else
    echo ""
    echo "Step 3: Preserving persistent data and Docker image..."
    echo "✓ Persistent data preserved in: private/"
    echo "✓ Docker image 'argus-bind9:latest' preserved"
    echo ""
    echo "To perform full cleanup including persistent data and image, run:"
    echo "  $0 --full"
 fi
 echo ""
 echo "=== Cleanup Summary ==="
 echo "✓ Containers stopped and removed"
 echo "✓ Docker networks cleaned"
 if [ "$FULL_CLEANUP" = true ]; then
    echo "✓ Persistent data removed"
    echo ""
    echo "Full cleanup completed! Test environment completely removed."
 else
    echo "✓ Persistent data preserved"
    echo "✓ Docker image preserved"
    echo ""
    echo "Basic cleanup completed! Run './01_start_container.sh' to restart testing."
 fi
 echo ""
 echo "Test environment cleanup finished."
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
yuyr	4c45166b44	[#50 ] 完成在x86_64平台上使用buildx来构建arm镜像；重新构建了core/web/prom/grafana以及cpu node镜像；跑通swarm test for arm; 暂时移除dcgm和log路径功能	2025-11-24 16:31:39 +08:00
yuyr	2caf0fa214	[#49 ] 优化swarm test支持自动reboot和verify	2025-11-20 15:21:18 +08:00
yuyr	d4e0dc1511	[#49 ] swarm test重启测试通过	2025-11-19 17:26:26 +08:00
yuyr	1d38304936	[#49 ] 移除bind/ftp，改成使用docker 自带域名解析，实现重启后IP变化仍正常访问，完成swarm test通过	2025-11-18 15:10:38 +08:00
yuyr	5b617f62a8	[#43 ] 区分server&client 不同compose project；修改require compose监测	2025-11-17 12:17:11 +08:00
yuyr	69e7a3e2b8	[#47 ] 增加cpu bundle镜像构建	2025-11-14 16:43:34 +08:00
yuyr	b402fdf960	[#37 ] 删除旧部署构建目录；跑通src/sys/tests; 增加一键构建 server pkg和client gpu pkg	2025-11-14 15:07:24 +08:00
yuyr	fff90826a4	[#41 ] 增加一键从源码构建 server_pkg和client_pkg	2025-11-13 15:02:33 +08:00
yuyr	d0411e6b97	[#41 ] 优化gpu bundle构建优化	2025-11-13 10:28:37 +08:00
yuyr	06131a268a	[#40 ] log目录供宿主机其他程序可写	2025-11-12 15:06:37 +08:00
yuyr	df1f519355	[#39 ] 增加新docker compose部署方式，验证通过gpu部署；剩余fluent-bit logs目录权限问题以及 gpu bundle打包优化问题	2025-11-12 12:07:04 +08:00
yuyr	6837d96035	[#39 ] 增加使用busybox warmup接入overlay网络	2025-11-10 12:17:10 +08:00
yuyr	dac180f12b	[#37 ] 修复dcgm exporter启动	2025-11-07 17:29:06 +08:00
yuyr	1819fb9c46	[#37 ] 修复alert镜像用户	2025-11-07 12:21:41 +08:00
yuyr	7548e46d1f	[#37 ] 增加gpu bundle node镜像构建	2025-11-07 10:23:59 +08:00
yuyr	0b9268332f	[#37 ] 修复log时间戳测试问题	2025-11-06 17:20:40 +08:00
yuyr	d1fad4a05a	[#37 ] 增加sys/swarm_tests(cpu) ；单独构建的node bundle镜像	2025-11-06 16:43:14 +08:00
yuyr	94b3e910b3	[#37 ] 增加部署时自动检测空闲端口；增加es 水位检测和临时应急处理	2025-11-05 16:21:34 +08:00
yuyr	2ff7c55f3b	[#37 ] 测试通过swarm跨机部署节点；更新文档	2025-11-05 09:57:08 +08:00
yuyr	9858f4471e	[#37 ] server install 增加重试自检	2025-11-05 09:57:08 +08:00
yuyr	c8279997a4	[#37 ] swarm 部署优化	2025-11-05 09:57:08 +08:00
yuyr	4ed5c64804	[#37 ] 优化client构建	2025-11-05 09:57:08 +08:00
yuyr	3551360687	[#37 ] 优化client安装包	2025-11-05 09:57:08 +08:00
yuyr	3202e02b42	[#37 ] 测试NixOS部署server通过	2025-11-05 09:57:07 +08:00
yuyr	29eb75a374	[#37 ] 构建安装包	2025-11-05 09:57:07 +08:00
yuyr	ccc141f557	[#30 ] ftp容器增加动态检测并更新dns.conf到share目录	2025-11-05 09:57:07 +08:00
yuyr	ed0d1ca904	[#30 ] wsl部署测试；更新README	2025-11-05 09:57:07 +08:00
xuxt	b6da5bc8b8	dev_1.0.0_xuxt_3 完成web和alert的集成测试 (#38 ) Co-authored-by: xiuting.xu <xiutingxt.xu@gmail.com> Reviewed-on: #38 Reviewed-by: huhy <husteryezi@163.com> Reviewed-by: yuyr <yuyr@zgclab.edu.cn> Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>	2025-10-31 14:18:19 +08:00
yuyr	59a38513a4	完成a6000测试系统构建、部署、测试整合 (#35 ) 测试方案： - lm2机器端口映射到本机：18080, 18081, 8082-8085 - 访问URL: http://localhost:18080/dashboard ![image.png](/attachments/30ed6e20-697a-4d3b-a6d3-6acccd2e9922) ![image.png](/attachments/38ef1751-0f3b-49c6-9100-f70d15617acc) ![image.png](/attachments/3be45005-9b9e-4165-8ef6-1d27405800f1) ![image.png](/attachments/eb916192-edc1-4096-8f9f-9769ab6d9039) ![image.png](/attachments/620e6efc-bd02-45ae-bba1-99a95a1b4c02) ![image.png](/attachments/986e77e7-c687-405f-a760-93282249f72f) 端到端测试通过： ![image.png](/attachments/c6e29875-4a16-4718-8b2f-368f64eb545e) Co-authored-by: sundapeng.sdp <sundapeng@hashdata.cn> Reviewed-on: #35 Reviewed-by: xuxt <xuxt@zgclab.edu.cn> Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn> Reviewed-by: huhy <husteryezi@163.com>	2025-10-29 10:04:27 +08:00
xuxt	d1b89c0cf6	dev_1.0.0_xuxt_2 更新反向代理，打包镜像，以及README文档 (#28 ) Co-authored-by: xiuting.xu <xiutingxt.xu@gmail.com> Reviewed-on: #28 Reviewed-by: yuyr <yuyr@zgclab.edu.cn> Reviewed-by: huhy <husteryezi@163.com> Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>	2025-10-20 09:45:32 +08:00
sundapeng	1a768bc837	dev_1.0.0_sundp_2 优化Argus-metric模块的e2e部署测试流程 (#27 ) Co-authored-by: sundapeng.sdp <sundapeng@hashdata.cn> Reviewed-on: #27 Reviewed-by: yuyr <yuyr@zgclab.edu.cn> Reviewed-by: xuxt <xuxt@zgclab.edu.cn>	2025-10-17 17:15:55 +08:00
yuyr	31ccb0b1b8	增加sys/debug 部署测试；agent dev/user/instance元信息提取优化；sys/tests 优化 (#26 ) Reviewed-on: #26 Reviewed-by: xuxt <xuxt@zgclab.edu.cn> Reviewed-by: huhy <husteryezi@163.com> Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>	2025-10-16 17:16:07 +08:00
xuxt	8fbe107ac9	dev_1.0.0_xuxt 完成web和alert模块开发，以及模块e2e测试 (#21 ) Co-authored-by: xiuting.xu <xiutingxt.xu@gmail.com> Reviewed-on: #21 Reviewed-by: huhy <husteryezi@163.com> Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn> Reviewed-by: yuyr <yuyr@zgclab.edu.cn>	2025-10-14 10:20:45 +08:00
sundapeng	c098f1d3ce	dev_1.0.0_sundp 完成Metric模块及模块e2e测试 (#18 ) Co-authored-by: sundapeng.sdp <sundapeng@hashdata.cn> Reviewed-on: #18 Reviewed-by: xuxt <xuxt@zgclab.edu.cn> Reviewed-by: yuyr <yuyr@zgclab.edu.cn> Reviewed-by: huhy <husteryezi@163.com>	2025-10-11 17:15:06 +08:00
yuyr	1e5e91b193	dev_1.0.0_yuyr_2：重新提交 PR，增加 master/agent 以及系统集成测试 (#17 ) Reviewed-on: #17 Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn> Reviewed-by: xuxt <xuxt@zgclab.edu.cn>	2025-10-11 15:04:46 +08:00
yuyr	8a38d3d0b2	dev_1.0.0_yuyr 完成 log和bind模块开发部署测试 (#8 ) - [x] 完成log模块镜像构建、本地端到端写日志——收集——查询流程； - [x] 完成bind模块构建； - [x] 内置域名IP自动更新脚本，使用 /private/argus/etc目录下文件进行同步，容器启动时自动写IP，定时任务刷新更新DNS服务器IP和DNS规则； Co-authored-by: root <root@curious.host.com> Reviewed-on: #8 Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>	2025-09-22 16:39:38 +08:00
yuyr	26e1c964ed	init project	2025-09-15 11:00:03 +08:00
		`@ -0,0 +1 @@`
							`src/metric/client-plugins/all-in-one-full/plugins//bin/ filter=lfs diff=lfs merge=lfs -text`
		`@ -0,0 +1,2 @@`
							`# Local overrides for build user/group settings`
							`build_user.local.conf`