Compare commits

..

22 Commits

Author SHA1 Message Date
sundapeng.sdp
b6d49ea562 feat: 优化Argus-Metric测试流程README.md;
refs #3
2025-10-11 16:50:28 +08:00
sundapeng.sdp
5211769ba8 feat: 测试环境搭建,e2e测试;
refs #3
2025-10-11 15:52:50 +08:00
sundapeng.sdp
28ef5df6e4 feat: 动态监测各组件健康状态,组件进程失败后动态拉起;
refs #11
2025-10-11 15:52:50 +08:00
sundapeng.sdp
687d29993a feat: 创建Grafana Dashboard基础配置;各镜像构建新增dns-monitor.sh脚本及相关定时任务; 2025-10-11 15:52:50 +08:00
sundapeng.sdp
54185101f0 feat: Dashboard支持 instance、hostname 不同var的过滤数据;
refs #12
2025-10-11 15:52:50 +08:00
sundapeng.sdp
018cdb40cc fix: 修复组件统一安装时,各组件入参不匹配的问题;
refs #11
2025-10-11 15:52:49 +08:00
sundapeng.sdp
36290bb08f feat: 基于原镜像构建对应的版本,解决账户兼容与持久化问题;创建 Dashboard 模版,default_dashboard.json;
refs #12
2025-10-11 15:52:49 +08:00
sundapeng.sdp
d514283cda fix: 修复 exporter_config.json 拷贝失败的问题;
refs #9
2025-10-11 15:52:49 +08:00
sundapeng.sdp
bfa1d71c49 feat: 定时任务拉取/检测变更dns.conf,更新客户端/etc/resolv.conf;定时任务拉取新版本更新;
refs #11
2025-10-11 15:52:49 +08:00
sundapeng.sdp
cb659f581b feat: 设置全局环境变量;集成fluent-bit安装包;设置组件健康检查目录;
refs #11
2025-10-11 15:52:49 +08:00
sundapeng.sdp
55f9317a1b feat: 用户侧安装组件前,提前下载全局配置文件并设置环境变量;集成 fluent-bit 安装包;
refs #11
2025-10-11 15:52:49 +08:00
sundapeng.sdp
f305a65abf feat: 打包分发程序目录调整
refs #11
2025-10-11 15:52:49 +08:00
sundapeng.sdp
46c34f3de6 feat: 定时解析argus-master产生的资源文件,生成动态配置文件,用于 Prometheus 热加载配置;
refs #9
2025-10-11 15:52:48 +08:00
sundapeng.sdp
8e2d751c2c fix: 修复启动服务失败导致的二次安装问题;修复缺少deps系统依赖问题;
refs #3
2025-10-11 15:52:48 +08:00
sundapeng.sdp
ab44edc22f feat: FTP镜像构建;加载vsftpd配置;兼容特定账户;配置共享目录;
refs #10
2025-10-11 15:52:48 +08:00
sundapeng.sdp
586740a17a feat: 基于算力平台的Prometheus镜像改造,supervisor自启应用;调整Prometheus.yml结构;
refs #9
2025-10-11 15:52:48 +08:00
sundapeng.sdp
5fa6e80578 feat: FTP服务器离线安装及配置;数据采集客户端支持一键部署、版本校验、组件健康检查、失败回滚等功能;
refs #3
2025-10-11 15:52:48 +08:00
sundapeng.sdp
e941537dd5 feat: 提交示例Node-Exporter安装包;
refs #3
2025-10-11 15:52:48 +08:00
sundapeng.sdp
4fb8652177 feat: 支持all-in-one模式构建、发布安装包;支持各组件健康状态监测;Prometheus相关配置文件;
refs #3
2025-10-11 15:52:48 +08:00
1e5e91b193 dev_1.0.0_yuyr_2:重新提交 PR,增加 master/agent 以及系统集成测试 (#17)
Reviewed-on: #17
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
Reviewed-by: xuxt <xuxt@zgclab.edu.cn>
2025-10-11 15:04:46 +08:00
8a38d3d0b2 dev_1.0.0_yuyr 完成 log和bind模块开发部署测试 (#8)
- [x] 完成log模块镜像构建、本地端到端写日志——收集——查询流程;
- [x] 完成bind模块构建;
- [x] 内置域名IP自动更新脚本,使用 /private/argus/etc目录下文件进行同步,容器启动时自动写IP,定时任务刷新更新DNS服务器IP和DNS规则;

Co-authored-by: root <root@curious.host.com>
Reviewed-on: #8
Reviewed-by: sundapeng <sundp@mail.zgclab.edu.cn>
2025-09-22 16:39:38 +08:00
26e1c964ed init project 2025-09-15 11:00:03 +08:00
343 changed files with 1186 additions and 27917 deletions

1
.gitattributes vendored
View File

@ -1 +0,0 @@
src/metric/client-plugins/all-in-one-full/plugins/*/bin/* filter=lfs diff=lfs merge=lfs -text

View File

@ -1,150 +0,0 @@
# ARGUS 统一构建脚本使用说明build/build_images.sh
本目录提供单一入口脚本 `build/build_images.sh`,覆盖常见三类场景:
- 系统集成测试src/sys/tests
- Swarm 系统集成测试src/sys/swarm_tests
- 构建离线安装包deployment_newServer/ClientGPU
文档还说明 UID/GID 取值规则、镜像 tag 策略、常用参数与重试机制。
## 环境前置
- Docker Engine ≥ 20.10(建议 ≥ 23.x/24.x
- Docker Compose v2`docker compose` 子命令)
- 可选:内网构建镜像源(`--intranet`
## UID/GID 规则(用于容器内用户/卷属主)
- 非 pkg 构建core/master/metric/web/alert/sys/gpu_bundle/cpu_bundle
- 读取 `configs/build_user.local.conf``configs/build_user.conf`
- 可被环境变量覆盖:`ARGUS_BUILD_UID``ARGUS_BUILD_GID`
- pkg 构建(`--only server_pkg``--only client_pkg`
- 读取 `configs/build_user.pkg.conf`(优先)→ `build_user.local.conf``build_user.conf`
- 可被环境变量覆盖;
- CPU bundle 明确走“非 pkg”链不读取 `build_user.pkg.conf`)。
- 说明:仅依赖 UID/GID 的 Docker 层会因参数变动而自动重建,不同构建剖面不会“打错包”。
## 镜像 tag 策略
- 非 pkg 构建:默认输出 `:latest`
- `--only server_pkg`:所有镜像直接输出为 `:<VERSION>`(不覆盖 `:latest`)。
- `--only client_pkg`GPU bundle 仅输出 `:<VERSION>`(不覆盖 `:latest`)。
- `--only cpu_bundle`:默认仅输出 `:<VERSION>`;可加 `--tag-latest` 同时打 `:latest` 以兼容 swarm_tests 默认 compose。
## 不加 --only 的默认构建目标
不指定 `--only` 时,脚本会构建“基础镜像集合”(不含 bundle 与安装包):
- core`argus-elasticsearch:latest``argus-kibana:latest``argus-bind9:latest`
- master`argus-master:latest`(非 offline
- metric`argus-metric-ftp:latest``argus-metric-prometheus:latest``argus-metric-grafana:latest`
- web`argus-web-frontend:latest``argus-web-proxy:latest`
- alert`argus-alertmanager:latest`
- sys`argus-sys-node:latest``argus-sys-metric-test-node:latest``argus-sys-metric-test-gpu-node:latest`
说明:默认 tag 为 `:latest`UID/GID 走“非 pkg”链`build_user.local.conf → build_user.conf`,可被环境变量覆盖)。
## 通用参数
- `--intranet`:使用内网构建参数(各 Dockerfile 中按需启用)。
- `--no-cache`:禁用 Docker 层缓存。
- `--only <list>`:逗号分隔目标,例:`--only core,master,metric,web,alert`
- `--version YYMMDD`bundle/pkg 的日期标签(必填于 cpu_bundle/gpu_bundle/server_pkg/client_pkg
- `--client-semver X.Y.Z`allinonefull 客户端语义化版本(可选)。
- `--cuda VER`GPU bundle CUDA 基镜版本(默认 12.2.2)。
- `--tag-latest`CPU bundle 构建时同时打 `:latest`
## 自动重试
- 构建单镜像失败会自动重试(默认 3 次,间隔 5s
- 最后一次自动使用 `DOCKER_BUILDKIT=0` 再试,缓解 “failed to receive status: context canceled”。
- 可调:`ARGUS_BUILD_RETRIES``ARGUS_BUILD_RETRY_DELAY` 环境变量。
---
## 场景一系统集成测试src/sys/tests
构建用于系统级端到端测试的镜像(默认 `:latest`)。
示例:
```
# 构建核心与周边
./build/build_images.sh --only core,master,metric,web,alert,sys
```
产出:
- 本地镜像:`argus-elasticsearch:latest``argus-kibana:latest``argus-master:latest``argus-metric-ftp:latest``argus-metric-prometheus:latest``argus-metric-grafana:latest``argus-alertmanager:latest``argus-web-frontend:latest``argus-web-proxy:latest``argus-sys-node:latest` 等。
说明:
- UID/GID 读取 `build_user.local.conf → build_user.conf`(或环境变量覆盖)。
- sys/tests 的执行见 `src/sys/tests/README.md`
---
## 场景二Swarm 系统集成测试src/sys/swarm_tests
需要服务端镜像 + CPU 节点 bundle 镜像。
步骤:
1) 构建服务端镜像(默认 `:latest`
```
./build/build_images.sh --only core,master,metric,web,alert
```
2) 构建 CPU bundle直接 FROM ubuntu:22.04
```
# 仅版本 tag 输出
./build/build_images.sh --only cpu_bundle --version 20251114
# 若要兼容 swarm_tests 默认 latest
./build/build_images.sh --only cpu_bundle --version 20251114 --tag-latest
```
3) 运行 Swarm 测试
```
cd src/sys/swarm_tests
# 如未打 latest可先指定
export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:20251114
./scripts/01_server_up.sh
./scripts/02_wait_ready.sh
./scripts/03_nodes_up.sh
./scripts/04_metric_verify.sh # 验证 Prometheus/Grafana/nodes.json 与日志通路
./scripts/99_down.sh # 结束
```
产出:
- 本地镜像:`argus-*:latest``argus-sys-metric-test-node-bundle:20251114`(或 latest
- `swarm_tests/private-*`:运行态持久化文件。
说明:
- CPU bundle 构建用户走“非 pkg”链local.conf → conf
- `04_metric_verify.sh` 已内置 Fluent Bit 启动与配置修正逻辑,偶发未就绪可重跑一次即通过。
---
## 场景三构建离线安装包deployment_new
Server 与 ClientGPU 安装包均采用“版本直出”,只输出 `:<VERSION>` 标签,不改动 `:latest`
1) Server 包
```
./build/build_images.sh --only server_pkg --version 20251114
```
产出:
- 本地镜像:`argus-<模块>:20251114`(不触碰 latest
- 安装包:`deployment_new/artifact/server/20251114/``server_20251114.tar.gz`
- 包内包含:逐镜像 tar.gz、compose/.env.example、scriptsconfig/install/selfcheck/diagnose 等、docs、manifest/checksums。
2) ClientGPU 包
```
# 同步构建 GPU bundle仅 :<VERSION>,不触碰 latest并生成客户端包
./build/build_images.sh --only client_pkg --version 20251114 \\
--client-semver 1.44.0 --cuda 12.2.2
```
产出:
- 本地镜像:`argus-sys-metric-test-node-bundle-gpu:20251114`
- 安装包:`deployment_new/artifact/client_gpu/20251114/``client_gpu_20251114.tar.gz`
- 包内包含GPU bundle 镜像 tar.gz、busybox.tar、compose/.env.example、scriptsconfig/install/uninstall、docs、manifest/checksums。
说明:
- pkg 构建使用 `configs/build_user.pkg.conf` 的 UID/GID可被环境覆盖
- 包内 `.env.example``PKG_VERSION=<VERSION>` 与镜像 tag 严格一致。
---
## 常见问题FAQ
- 构建报 `failed to receive status: context canceled`
- 已内置单镜像多次重试,最后一次禁用 BuildKit建议加 `--intranet``--no-cache` 重试,或 `docker builder prune -f` 后再试。
- 先跑非 pkglatest再跑 pkgversion会不会“打错包”
- 不会。涉及 UID/GID 的层因参数变化会重建,其它层按缓存命中复用,最终 pkg 产物的属主与运行账户按 `build_user.pkg.conf` 生效。
- swarm_tests 默认拉取 `:latest`,我只构建了 `:<VERSION>` 的 CPU bundle 怎么办?
- 在运行前 `export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:<VERSION>`,或在构建时加 `--tag-latest`
---
如需进一步自动化(例如生成 BUILD_SUMMARY.txt 汇总镜像 digest 与构建参数),可在 pkg 产出阶段追加,我可以按需补齐。

View File

@ -10,45 +10,19 @@ Usage: $0 [OPTIONS]
Options:
--intranet Use intranet mirror for log/bind builds
--master-offline Build master offline image (requires src/master/offline_wheels.tar.gz)
--metric Build metric module images (ftp, prometheus, grafana, test nodes)
--no-cache Build all images without using Docker layer cache
--only LIST Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
--version DATE Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
--client-semver X.Y.Z Override client semver used in all-in-one-full artifact (optional)
--cuda VER CUDA runtime version for NVIDIA base (default: 12.2.2)
--tag-latest Also tag bundle image as :latest (for cpu_bundle only; default off)
-h, --help Show this help message
Examples:
$0 # Build with default sources
$0 --intranet # Build with intranet mirror
$0 --master-offline # Additionally build argus-master:offline
$0 --metric # Additionally build metric module images
$0 --intranet --master-offline --metric
$0 --intranet --master-offline
EOF
}
use_intranet=false
build_core=true
build_master=true
build_master_offline=false
build_metric=true
build_web=true
build_alert=true
build_sys=true
build_gpu_bundle=false
build_cpu_bundle=false
build_server_pkg=false
build_client_pkg=false
need_bind_image=true
need_metric_ftp=true
no_cache=false
bundle_date=""
client_semver=""
cuda_ver="12.2.2"
DEFAULT_IMAGE_TAG="latest"
tag_latest=false
while [[ $# -gt 0 ]]; do
case $1 in
@ -65,55 +39,6 @@ while [[ $# -gt 0 ]]; do
build_master_offline=true
shift
;;
--metric)
build_metric=true
shift
;;
--no-cache)
no_cache=true
shift
;;
--only)
if [[ -z ${2:-} ]]; then
echo "--only requires a target list" >&2; exit 1
fi
sel="$2"; shift 2
# reset all, then enable selected
build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
IFS=',' read -ra parts <<< "$sel"
for p in "${parts[@]}"; do
case "$p" in
core) build_core=true ;;
master) build_master=true ;;
metric) build_metric=true ;;
web) build_web=true ;;
alert) build_alert=true ;;
sys) build_sys=true ;;
gpu_bundle) build_gpu_bundle=true ;;
cpu_bundle) build_cpu_bundle=true ;;
server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
client_pkg) build_client_pkg=true ;;
all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
*) echo "Unknown --only target: $p" >&2; exit 1 ;;
esac
done
;;
--version)
if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
bundle_date="$2"; shift 2
;;
--client-semver)
if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
client_semver="$2"; shift 2
;;
--cuda)
if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
cuda_ver="$2"; shift 2
;;
--tag-latest)
tag_latest=true
shift
;;
-h|--help)
show_help
exit 0
@ -126,11 +51,6 @@ while [[ $# -gt 0 ]]; do
esac
done
if [[ "$build_server_pkg" == true ]]; then
need_bind_image=false
need_metric_ftp=false
fi
root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
. "$root/scripts/common/build_user.sh"
@ -142,23 +62,9 @@ fi
cd "$root"
# Set default image tag policy before building
if [[ "$build_server_pkg" == true ]]; then
DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
fi
# Select build user profile for pkg vs default
if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
export ARGUS_BUILD_PROFILE=pkg
fi
load_build_user
build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
if [[ "$no_cache" == true ]]; then
build_args+=("--no-cache")
fi
master_root="$root/src/master"
master_offline_tar="$master_root/offline_wheels.tar.gz"
master_offline_dir="$master_root/offline_wheels"
@ -199,459 +105,45 @@ build_image() {
local image_name=$1
local dockerfile_path=$2
local tag=$3
local context="."
shift 3
if [[ $# -gt 0 ]]; then
context=$1
shift
fi
local extra_args=("$@")
echo "🔄 Building $image_name image..."
echo " Dockerfile: $dockerfile_path"
echo " Tag: $tag"
echo " Context: $context"
local tries=${ARGUS_BUILD_RETRIES:-3}
local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
local attempt=1
while (( attempt <= tries )); do
local prefix=""
if (( attempt == tries )); then
# final attempt: disable BuildKit to avoid docker/dockerfile front-end pulls
prefix="DOCKER_BUILDKIT=0"
echo " Attempt ${attempt}/${tries} (fallback: DOCKER_BUILDKIT=0)"
else
echo " Attempt ${attempt}/${tries}"
fi
if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
echo "$image_name image built successfully"
return 0
fi
echo "⚠️ Build failed for $image_name (attempt ${attempt}/${tries})."
if (( attempt < tries )); then
echo " Retrying in ${delay}s..."
sleep "$delay"
fi
attempt=$((attempt+1))
done
echo "❌ Failed to build $image_name image after ${tries} attempts"
return 1
}
pull_base_image() {
local image_ref=$1
local attempts=${2:-3}
local delay=${3:-5}
# If the image already exists locally, skip pulling.
if docker image inspect "$image_ref" >/dev/null 2>&1; then
echo " Local image present; skip pull: $image_ref"
if docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" .; then
echo "$image_name image built successfully"
return 0
else
echo "❌ Failed to build $image_name image"
return 1
fi
for ((i=1; i<=attempts; i++)); do
echo " Pulling base image ($i/$attempts): $image_ref"
if docker pull "$image_ref" >/dev/null; then
echo " Base image ready: $image_ref"
return 0
fi
echo " Pull failed: $image_ref"
if (( i < attempts )); then
echo " Retrying in ${delay}s..."
sleep "$delay"
fi
done
echo "❌ Unable to pull base image after ${attempts} attempts: $image_ref"
return 1
}
images_built=()
build_failed=false
build_gpu_bundle_image() {
local date_tag="$1" # e.g. 20251112
local cuda_ver_local="$2" # e.g. 12.2.2
local client_ver="$3" # semver like 1.43.0
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:latest"; then
images_built+=("argus-elasticsearch:latest")
else
build_failed=true
fi
if [[ -z "$date_tag" ]]; then
echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
return 1
fi
echo ""
# sanitize cuda version (trim trailing dots like '12.2.')
while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:latest"; then
images_built+=("argus-kibana:latest")
else
build_failed=true
fi
# Resolve effective CUDA base tag
local resolve_cuda_base_tag
resolve_cuda_base_tag() {
local want="$1" # can be 12, 12.2 or 12.2.2
local major minor patch
if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
# try to find best local patch for major.minor
local best
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
sort -V | tail -n1 || true)
if [[ -n "$best" ]]; then
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
fi
# fallback patch if none local
echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
elif [[ "$want" =~ ^([0-9]+)$ ]]; then
major="${BASH_REMATCH[1]}"
# try to find best local for this major
local best
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
sort -V | tail -n1 || true)
if [[ -n "$best" ]]; then
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
fi
echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
else
# invalid format, fallback to default
echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
fi
}
echo ""
local base_image
base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
echo
echo "🔧 Preparing one-click GPU bundle build"
echo " CUDA runtime base: ${base_image}"
echo " Bundle tag : ${date_tag}"
# 1) Ensure NVIDIA base image (skip pull if local)
if ! pull_base_image "$base_image"; then
# try once more with default if resolution failed
if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
return 1
else
base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
fi
fi
# 2) Build latest argus-agent from source
echo "\n🛠 Building argus-agent from src/agent"
pushd "$root/src/agent" >/dev/null
if ! bash scripts/build_binary.sh; then
echo "❌ argus-agent build failed" >&2
popd >/dev/null
return 1
fi
if [[ ! -f "dist/argus-agent" ]]; then
echo "❌ argus-agent binary missing after build" >&2
popd >/dev/null
return 1
fi
popd >/dev/null
# 3) Inject agent into all-in-one-full plugin and package artifact
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
local agent_bin_src="$root/src/agent/dist/argus-agent"
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
cp -f "$agent_bin_src" "$agent_bin_dst"
chmod +x "$agent_bin_dst" || true
pushd "$aio_root" >/dev/null
local prev_version
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
local use_version="$prev_version"
if [[ -n "$client_semver" ]]; then
echo "${client_semver}" > config/VERSION
use_version="$client_semver"
fi
echo " Packaging all-in-one-full artifact version: $use_version"
if ! bash scripts/package_artifact.sh --force; then
echo "❌ package_artifact.sh failed" >&2
# restore VERSION if changed
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
return 1
fi
local artifact_dir="$aio_root/artifact/$use_version"
local artifact_tar
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
if [[ -z "$artifact_tar" ]]; then
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
local owner="$(id -u):$(id -g)"
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
echo "❌ publish_artifact.sh failed" >&2
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
return 1
fi
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
fi
if [[ -z "$artifact_tar" ]]; then
echo "❌ artifact tar not found under $artifact_dir" >&2
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
return 1
fi
# restore VERSION if changed (keep filesystem clean)
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
popd >/dev/null
# 4) Stage docker build context
local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
echo "\n🧰 Staging docker build context: $bundle_ctx"
rm -rf "$bundle_ctx"
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
# bundle tar
cp "$artifact_tar" "$bundle_ctx/bundle/"
# offline fluent-bit assets (optional but useful)
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
fi
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
fi
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
fi
# 5) Build the final bundle image (directly from NVIDIA base)
local image_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
echo "\n🔄 Building GPU Bundle image"
if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
--build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
--build-arg CLIENT_VER="$use_version" \
--build-arg BUNDLE_DATE="$date_tag"; then
images_built+=("$image_tag")
# In non-pkg mode, also tag latest for convenience
if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu:latest >/dev/null 2>&1 || true
fi
return 0
else
return 1
fi
}
# Tag helper: ensure :<date_tag> exists for a list of repos
ensure_version_tags() {
local date_tag="$1"; shift
local repos=("$@")
for repo in "${repos[@]}"; do
if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
:
elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
docker tag "$repo:latest" "$repo:$date_tag" || true
else
echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
return 1
fi
done
return 0
}
# Build server package after images are built
build_server_pkg_bundle() {
local date_tag="$1"
if [[ -z "$date_tag" ]]; then
echo "❌ server_pkg requires --version YYMMDD" >&2
return 1
fi
local repos=(
argus-master argus-elasticsearch argus-kibana \
argus-metric-prometheus argus-metric-grafana \
argus-alertmanager argus-web-frontend argus-web-proxy
)
echo "\n🔖 Verifying server images with :$date_tag and collecting digests (Bind/FTP excluded; relying on Docker DNS aliases)"
for repo in "${repos[@]}"; do
if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
return 1
fi
done
# Optional: show digests
for repo in "${repos[@]}"; do
local digest
digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
printf ' • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
done
echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
echo "❌ make_server_package.sh failed" >&2
return 1
fi
return 0
}
# Build client package: ensure gpu bundle image exists, then package client_gpu
build_client_pkg_bundle() {
local date_tag="$1"
local semver="$2"
local cuda="$3"
if [[ -z "$date_tag" ]]; then
echo "❌ client_pkg requires --version YYMMDD" >&2
return 1
fi
local bundle_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
ARGUS_PKG_BUILD=1
export ARGUS_PKG_BUILD
if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
return 1
fi
else
echo "\n✅ Using existing GPU bundle image: $bundle_tag"
fi
echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
echo "❌ make_client_gpu_package.sh failed" >&2
return 1
fi
return 0
}
# Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
build_cpu_bundle_image() {
local date_tag="$1" # e.g. 20251113
local client_ver_in="$2" # semver like 1.43.0 (optional)
local want_tag_latest="$3" # true/false
if [[ -z "$date_tag" ]]; then
echo "❌ cpu_bundle requires --version YYMMDD" >&2
return 1
fi
echo "\n🔧 Preparing one-click CPU bundle build"
echo " Base: ubuntu:22.04"
echo " Bundle tag: ${date_tag}"
# 1) Build latest argus-agent from source
echo "\n🛠 Building argus-agent from src/agent"
pushd "$root/src/agent" >/dev/null
if ! bash scripts/build_binary.sh; then
echo "❌ argus-agent build failed" >&2
popd >/dev/null
return 1
fi
if [[ ! -f "dist/argus-agent" ]]; then
echo "❌ argus-agent binary missing after build" >&2
popd >/dev/null
return 1
fi
popd >/dev/null
# 2) Inject agent into all-in-one-full plugin and package artifact
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
local agent_bin_src="$root/src/agent/dist/argus-agent"
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
cp -f "$agent_bin_src" "$agent_bin_dst"
chmod +x "$agent_bin_dst" || true
pushd "$aio_root" >/dev/null
local prev_version use_version
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
use_version="$prev_version"
if [[ -n "$client_ver_in" ]]; then
echo "$client_ver_in" > config/VERSION
use_version="$client_ver_in"
fi
echo " Packaging all-in-one-full artifact: version=$use_version"
if ! bash scripts/package_artifact.sh --force; then
echo "❌ package_artifact.sh failed" >&2
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
popd >/dev/null
return 1
fi
local artifact_dir="$aio_root/artifact/$use_version"
local artifact_tar
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
if [[ -z "$artifact_tar" ]]; then
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
local owner="$(id -u):$(id -g)"
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
echo "❌ publish_artifact.sh failed" >&2
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
popd >/dev/null
return 1
fi
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
fi
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
popd >/dev/null
# 3) Stage docker build context
local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
echo "\n🧰 Staging docker build context: $bundle_ctx"
rm -rf "$bundle_ctx"
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
# bundle tar
cp "$artifact_tar" "$bundle_ctx/bundle/"
# offline fluent-bit assets
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
fi
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
fi
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
fi
# 4) Build final bundle image
local image_tag="argus-sys-metric-test-node-bundle:${date_tag}"
echo "\n🔄 Building CPU Bundle image"
if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
images_built+=("$image_tag")
if [[ "$want_tag_latest" == "true" ]]; then
docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
fi
return 0
else
return 1
fi
}
if [[ "$build_core" == true ]]; then
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:${DEFAULT_IMAGE_TAG}"; then
images_built+=("argus-elasticsearch:${DEFAULT_IMAGE_TAG}")
else
build_failed=true
fi
echo ""
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:${DEFAULT_IMAGE_TAG}"; then
images_built+=("argus-kibana:${DEFAULT_IMAGE_TAG}")
else
build_failed=true
fi
echo ""
if [[ "$need_bind_image" == true ]]; then
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:${DEFAULT_IMAGE_TAG}"; then
images_built+=("argus-bind9:${DEFAULT_IMAGE_TAG}")
else
build_failed=true
fi
fi
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:latest"; then
images_built+=("argus-bind9:latest")
else
build_failed=true
fi
echo ""
@ -660,21 +152,18 @@ if [[ "$build_master" == true ]]; then
echo ""
echo "🔄 Building Master image..."
pushd "$master_root" >/dev/null
master_args=("--tag" "argus-master:${DEFAULT_IMAGE_TAG}")
master_args=("--tag" "argus-master:latest")
if [[ "$use_intranet" == true ]]; then
master_args+=("--intranet")
fi
if [[ "$build_master_offline" == true ]]; then
master_args+=("--offline")
fi
if [[ "$no_cache" == true ]]; then
master_args+=("--no-cache")
fi
if ./scripts/build_images.sh "${master_args[@]}"; then
if [[ "$build_master_offline" == true ]]; then
images_built+=("argus-master:offline")
else
images_built+=("argus-master:${DEFAULT_IMAGE_TAG}")
images_built+=("argus-master:latest")
fi
else
build_failed=true
@ -682,176 +171,6 @@ if [[ "$build_master" == true ]]; then
popd >/dev/null
fi
if [[ "$build_metric" == true ]]; then
echo ""
echo "Building Metric module images..."
metric_base_images=(
"ubuntu/prometheus:3-24.04_stable"
"grafana/grafana:11.1.0"
)
if [[ "$need_metric_ftp" == true ]]; then
metric_base_images+=("ubuntu:22.04")
fi
for base_image in "${metric_base_images[@]}"; do
if ! pull_base_image "$base_image"; then
build_failed=true
fi
done
metric_builds=()
if [[ "$need_metric_ftp" == true ]]; then
metric_builds+=("Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build")
fi
metric_builds+=(
"Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
"Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
)
for build_spec in "${metric_builds[@]}"; do
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
images_built+=("$image_tag")
else
build_failed=true
fi
echo ""
done
fi
# =======================================
# Sys (system tests) node images
# =======================================
if [[ "$build_sys" == true ]]; then
echo ""
echo "Building Sys node images..."
sys_base_images=(
"ubuntu:22.04"
"nvidia/cuda:12.2.2-runtime-ubuntu22.04"
)
for base_image in "${sys_base_images[@]}"; do
if ! pull_base_image "$base_image"; then
build_failed=true
fi
done
sys_builds=(
"Sys Node|src/sys/build/node/Dockerfile|argus-sys-node:latest|."
"Sys Metric Test Node|src/sys/build/test-node/Dockerfile|argus-sys-metric-test-node:latest|."
"Sys Metric Test GPU Node|src/sys/build/test-gpu-node/Dockerfile|argus-sys-metric-test-gpu-node:latest|."
)
for build_spec in "${sys_builds[@]}"; do
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
images_built+=("$image_tag")
else
build_failed=true
fi
echo ""
done
fi
# =======================================
# Web & Alert module images
# =======================================
if [[ "$build_web" == true || "$build_alert" == true ]]; then
echo ""
echo "Building Web and Alert module images..."
# Pre-pull commonly used base images for stability
web_alert_base_images=(
"node:20"
"ubuntu:24.04"
)
for base_image in "${web_alert_base_images[@]}"; do
if ! pull_base_image "$base_image"; then
build_failed=true
fi
done
if [[ "$build_web" == true ]]; then
web_builds=(
"Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:${DEFAULT_IMAGE_TAG}|."
"Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:${DEFAULT_IMAGE_TAG}|."
)
for build_spec in "${web_builds[@]}"; do
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
images_built+=("$image_tag")
else
build_failed=true
fi
echo ""
done
fi
if [[ "$build_alert" == true ]]; then
alert_builds=(
"Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:${DEFAULT_IMAGE_TAG}|."
)
for build_spec in "${alert_builds[@]}"; do
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
images_built+=("$image_tag")
else
build_failed=true
fi
echo ""
done
fi
fi
# =======================================
# One-click GPU bundle (direct NVIDIA base)
# =======================================
if [[ "$build_gpu_bundle" == true ]]; then
echo ""
echo "Building one-click GPU bundle image..."
if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
build_failed=true
fi
fi
# =======================================
# One-click CPU bundle (from ubuntu:22.04)
# =======================================
if [[ "$build_cpu_bundle" == true ]]; then
echo ""
echo "Building one-click CPU bundle image..."
if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
build_failed=true
fi
fi
# =======================================
# One-click Server/Client packaging
# =======================================
if [[ "$build_server_pkg" == true ]]; then
echo ""
echo "🧳 Building one-click Server package..."
if ! build_server_pkg_bundle "${bundle_date}"; then
build_failed=true
fi
fi
if [[ "$build_client_pkg" == true ]]; then
echo ""
echo "🧳 Building one-click Client-GPU package..."
if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
build_failed=true
fi
fi
echo "======================================="
echo "📦 Build Summary"
echo "======================================="
@ -878,6 +197,7 @@ if [[ "$build_master_offline" == true ]]; then
echo ""
echo "🧳 Master offline wheels 已解压到 $master_offline_dir"
fi
echo ""
echo "🚀 Next steps:"
echo " ./build/save_images.sh --compress # 导出镜像"

View File

@ -68,12 +68,6 @@ declare -A images=(
["argus-kibana:latest"]="argus-kibana-latest.tar"
["argus-bind9:latest"]="argus-bind9-latest.tar"
["argus-master:offline"]="argus-master-offline.tar"
["argus-metric-ftp:latest"]="argus-metric-ftp-latest.tar"
["argus-metric-prometheus:latest"]="argus-metric-prometheus-latest.tar"
["argus-metric-grafana:latest"]="argus-metric-grafana-latest.tar"
["argus-web-frontend:latest"]="argus-web-frontend-latest.tar"
["argus-web-proxy:latest"]="argus-web-proxy-latest.tar"
["argus-alertmanager:latest"]="argus-alertmanager-latest.tar"
)
# 函数:检查镜像是否存在
@ -226,4 +220,4 @@ fi
echo ""
echo "✅ Image export completed successfully!"
echo ""
echo ""

View File

@ -1,6 +0,0 @@
# Default build-time UID/GID for Argus images
# Override by creating configs/build_user.local.conf with the same format.
# Syntax: KEY=VALUE, supports UID/GID only. Whitespace and lines starting with # are ignored.
UID=2133
GID=2015

View File

@ -1 +0,0 @@
artifact/

View File

@ -1,14 +0,0 @@
# deployment_new
本目录用于新的部署打包与交付实现(不影响既有 `deployment/`)。
里程碑 M1当前实现
- `build/make_server_package.sh`:生成 Server 包(逐服务镜像 tar.gz、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz
- `build/make_client_gpu_package.sh`:生成 ClientGPU 包GPU bundle 镜像 tar.gz、busybox.tar、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz
模板
- `templates/server/compose/docker-compose.yml`:部署专用,镜像默认使用 `:${PKG_VERSION}` 版本 tag可通过 `.env` 覆盖。
- `templates/client_gpu/compose/docker-compose.yml`GPU 节点专用,使用 `:${PKG_VERSION}` 版本 tag。
注意M1 仅产出安装包,不包含安装脚本落地;安装/运维脚本将在 M2 落地并纳入包内。

View File

@ -1,33 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
log() { echo -e "\033[0;34m[INFO]\033[0m $*"; }
warn() { echo -e "\033[1;33m[WARN]\033[0m $*"; }
err() { echo -e "\033[0;31m[ERR ]\033[0m $*" >&2; }
require_cmd() {
local miss=0
for c in "$@"; do
if ! command -v "$c" >/dev/null 2>&1; then err "missing command: $c"; miss=1; fi
done
[[ $miss -eq 0 ]]
}
today_version() { date +%Y%m%d; }
checksum_dir() {
local dir="$1"; local out="$2"; : > "$out";
(cd "$dir" && find . -type f -print0 | sort -z | xargs -0 sha256sum) >> "$out"
}
make_dir() { mkdir -p "$1"; }
copy_tree() {
local src="$1" dst="$2"; rsync -a --delete "$src/" "$dst/" 2>/dev/null || cp -r "$src/." "$dst/";
}
gen_manifest() {
local root="$1"; local out="$2"; : > "$out";
(cd "$root" && find . -maxdepth 4 -type f -printf "%p\n" | sort) >> "$out"
}

View File

@ -1,131 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# Make client GPU package (versioned gpu bundle image, compose, env, docs, busybox)
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/client_gpu"
ART_ROOT="$ROOT_DIR/deployment_new/artifact/client_gpu"
# Use deployment_new local common helpers
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
. "$COMMON_SH"
usage(){ cat <<EOF
Build Client-GPU Package (deployment_new)
Usage: $(basename "$0") --version YYYYMMDD [--image IMAGE[:TAG]]
Defaults:
image = argus-sys-metric-test-node-bundle-gpu:latest
Outputs: deployment_new/artifact/client_gpu/<YYYYMMDD>/ and client_gpu_YYYYMMDD.tar.gz
EOF
}
VERSION=""
IMAGE="argus-sys-metric-test-node-bundle-gpu:latest"
while [[ $# -gt 0 ]]; do
case "$1" in
--version) VERSION="$2"; shift 2;;
--image) IMAGE="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) err "unknown arg: $1"; usage; exit 1;;
esac
done
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
require_cmd docker tar gzip
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
PKG_DIR="$ART_ROOT/$VERSION"
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
# 1) Save GPU bundle image with version tag
if ! docker image inspect "$IMAGE" >/dev/null 2>&1; then
err "missing image: $IMAGE"; exit 1; fi
REPO="${IMAGE%%:*}"; TAG_VER="$REPO:$VERSION"
docker tag "$IMAGE" "$TAG_VER"
out_tar="$STAGE/images/${REPO//\//-}-$VERSION.tar"
docker save -o "$out_tar" "$TAG_VER"
gzip -f "$out_tar"
# 2) Busybox tar for connectivity/overlay warmup (prefer local template; fallback to docker save)
BB_SRC="$TEMPL_DIR/images/busybox.tar"
if [[ -f "$BB_SRC" ]]; then
cp "$BB_SRC" "$STAGE/images/busybox.tar"
else
if docker image inspect busybox:latest >/dev/null 2>&1 || docker pull busybox:latest >/dev/null 2>&1; then
docker save -o "$STAGE/images/busybox.tar" busybox:latest
log "Included busybox from local docker daemon"
else
warn "busybox image not found and cannot pull; skipping busybox.tar"
fi
fi
# 3) Compose + env template and docs/scripts from templates
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
ENV_EX="$STAGE/compose/.env.example"
cat >"$ENV_EX" <<EOF
# Generated by make_client_gpu_package.sh
PKG_VERSION=$VERSION
NODE_GPU_BUNDLE_IMAGE_TAG=${REPO}:${VERSION}
# Compose project name (isolation from server stack)
COMPOSE_PROJECT_NAME=argus-client
# Required (no defaults). Must be filled before install.
AGENT_ENV=
AGENT_USER=
AGENT_INSTANCE=
GPU_NODE_HOSTNAME=
# Overlay network (should match server包 overlay)
ARGUS_OVERLAY_NET=argus-sys-net
# From cluster-info.env (server package output)
SWARM_MANAGER_ADDR=
SWARM_JOIN_TOKEN_WORKER=
SWARM_JOIN_TOKEN_MANAGER=
EOF
# 4) Docs from deployment_new templates
CLIENT_DOC_SRC="$TEMPL_DIR/docs"
if [[ -d "$CLIENT_DOC_SRC" ]]; then
rsync -a "$CLIENT_DOC_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$CLIENT_DOC_SRC/." "$STAGE/docs/"
fi
# Placeholder scripts (will be implemented in M2)
cat >"$STAGE/scripts/README.md" <<'EOF'
# Client-GPU Scripts (Placeholder)
本目录将在 M2 引入:
- config.sh / install.sh
当前为占位,便于包结构审阅。
EOF
# 5) Scripts (from deployment_new templates) and Private skeleton
SCRIPTS_SRC="$TEMPL_DIR/scripts"
if [[ -d "$SCRIPTS_SRC" ]]; then
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
fi
mkdir -p "$STAGE/private/argus/agent"
# 6) Manifest & checksums
gen_manifest "$STAGE" "$STAGE/manifest.txt"
checksum_dir "$STAGE" "$STAGE/checksums.txt"
# 7) Move to artifact dir and pack
mkdir -p "$PKG_DIR"
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
OUT_TAR="$OUT_TAR_DIR/client_gpu_${VERSION}.tar.gz"
log "Creating tarball: $OUT_TAR"
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
log "Client-GPU package ready: $PKG_DIR"
echo "$OUT_TAR"

View File

@ -1,160 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# Make server deployment package (versioned, per-image tars, full compose, docs, skeleton)
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/server"
ART_ROOT="$ROOT_DIR/deployment_new/artifact/server"
# Use deployment_new local common helpers
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
. "$COMMON_SH"
usage(){ cat <<EOF
Build Server Deployment Package (deployment_new)
Usage: $(basename "$0") --version YYYYMMDD
Outputs: deployment_new/artifact/server/<YYYYMMDD>/ and server_YYYYMMDD.tar.gz
EOF
}
VERSION=""
while [[ $# -gt 0 ]]; do
case "$1" in
--version) VERSION="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) err "unknown arg: $1"; usage; exit 1;;
esac
done
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
require_cmd docker tar gzip awk sed
IMAGES=(
argus-master
argus-elasticsearch
argus-kibana
argus-metric-prometheus
argus-metric-grafana
argus-alertmanager
argus-web-frontend
argus-web-proxy
)
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
PKG_DIR="$ART_ROOT/$VERSION"
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
# 1) Save per-image tars with version tag
log "Tagging and saving images (version=$VERSION)"
for repo in "${IMAGES[@]}"; do
if ! docker image inspect "$repo:latest" >/dev/null 2>&1 && ! docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
err "missing image: $repo (need :latest or :$VERSION)"; exit 1; fi
if docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
tag="$repo:$VERSION"
else
docker tag "$repo:latest" "$repo:$VERSION"
tag="$repo:$VERSION"
fi
out_tar="$STAGE/images/${repo//\//-}-$VERSION.tar"
docker save -o "$out_tar" "$tag"
gzip -f "$out_tar"
done
# 2) Compose + env template
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
ENV_EX="$STAGE/compose/.env.example"
cat >"$ENV_EX" <<EOF
# Generated by make_server_package.sh
PKG_VERSION=$VERSION
# Image tags (can be overridden). Default to versioned tags
MASTER_IMAGE_TAG=argus-master:
ES_IMAGE_TAG=argus-elasticsearch:
KIBANA_IMAGE_TAG=argus-kibana:
PROM_IMAGE_TAG=argus-metric-prometheus:
GRAFANA_IMAGE_TAG=argus-metric-grafana:
ALERT_IMAGE_TAG=argus-alertmanager:
FRONT_IMAGE_TAG=argus-web-frontend:
WEB_PROXY_IMAGE_TAG=argus-web-proxy:
EOF
sed -i "s#:\$#:${VERSION}#g" "$ENV_EX"
# Ports and defaults (based on swarm_tests .env.example)
cat >>"$ENV_EX" <<'EOF'
# Host ports for server compose
MASTER_PORT=32300
ES_HTTP_PORT=9200
KIBANA_PORT=5601
PROMETHEUS_PORT=9090
GRAFANA_PORT=3000
ALERTMANAGER_PORT=9093
WEB_PROXY_PORT_8080=8080
WEB_PROXY_PORT_8081=8081
WEB_PROXY_PORT_8082=8082
WEB_PROXY_PORT_8083=8083
WEB_PROXY_PORT_8084=8084
WEB_PROXY_PORT_8085=8085
# Overlay network name
ARGUS_OVERLAY_NET=argus-sys-net
# UID/GID for volume ownership
ARGUS_BUILD_UID=2133
ARGUS_BUILD_GID=2015
# Compose project name (isolation from other stacks on same host)
COMPOSE_PROJECT_NAME=argus-server
EOF
# 3) Docs (from deployment_new templates)
DOCS_SRC="$TEMPL_DIR/docs"
if [[ -d "$DOCS_SRC" ]]; then
rsync -a "$DOCS_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$DOCS_SRC/." "$STAGE/docs/"
fi
# 6) Scripts (from deployment_new templates)
SCRIPTS_SRC="$TEMPL_DIR/scripts"
if [[ -d "$SCRIPTS_SRC" ]]; then
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
fi
# 4) Private skeleton (minimum)
mkdir -p \
"$STAGE/private/argus/etc" \
"$STAGE/private/argus/master" \
"$STAGE/private/argus/metric/prometheus" \
"$STAGE/private/argus/metric/prometheus/data" \
"$STAGE/private/argus/metric/prometheus/rules" \
"$STAGE/private/argus/metric/prometheus/targets" \
"$STAGE/private/argus/metric/grafana" \
"$STAGE/private/argus/metric/grafana/data" \
"$STAGE/private/argus/metric/grafana/logs" \
"$STAGE/private/argus/metric/grafana/plugins" \
"$STAGE/private/argus/metric/grafana/provisioning/datasources" \
"$STAGE/private/argus/metric/grafana/provisioning/dashboards" \
"$STAGE/private/argus/metric/grafana/data/sessions" \
"$STAGE/private/argus/metric/grafana/data/dashboards" \
"$STAGE/private/argus/metric/grafana/config" \
"$STAGE/private/argus/alert/alertmanager" \
"$STAGE/private/argus/log/elasticsearch" \
"$STAGE/private/argus/log/kibana"
# 7) Manifest & checksums
gen_manifest "$STAGE" "$STAGE/manifest.txt"
checksum_dir "$STAGE" "$STAGE/checksums.txt"
# 8) Move to artifact dir and pack
mkdir -p "$PKG_DIR"
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
OUT_TAR="$OUT_TAR_DIR/server_${VERSION}.tar.gz"
log "Creating tarball: $OUT_TAR"
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
log "Server package ready: $PKG_DIR"
echo "$OUT_TAR"

View File

@ -1,38 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
metric-gpu-node:
image: ${NODE_GPU_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle-gpu:${PKG_VERSION}}
container_name: argus-metric-gpu-node-swarm
hostname: ${GPU_NODE_HOSTNAME}
restart: unless-stopped
privileged: true
runtime: nvidia
environment:
- TZ=Asia/Shanghai
- DEBIAN_FRONTEND=noninteractive
- MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
# Fluent Bit / 日志上报目标(固定域名)
- ES_HOST=es.log.argus.com
- ES_PORT=9200
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- AGENT_ENV=${AGENT_ENV}
- AGENT_USER=${AGENT_USER}
- AGENT_INSTANCE=${AGENT_INSTANCE}
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- GPU_MODE=gpu
networks:
argus-sys-net:
aliases:
- ${AGENT_INSTANCE}.node.argus.com
volumes:
- ../private/argus/agent:/private/argus/agent
- ../logs/infer:/logs/infer
- ../logs/train:/logs/train
command: ["sleep", "infinity"]

View File

@ -1,73 +0,0 @@
# Argus ClientGPU 安装指南deployment_new
## 一、准备条件(开始前确认)
- GPU 节点安装了 NVIDIA 驱动,`nvidia-smi` 正常;
- Docker & Docker Compose v2 已安装;
- 使用统一账户 `argus`UID=2133GID=2015执行安装并加入 `docker` 组(如已创建可跳过):
```bash
sudo groupadd --gid 2015 argus || true
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
sudo passwd argus
sudo usermod -aG docker argus
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
```
后续解压与执行config/install/uninstall均使用 `argus` 账户进行。
- 从 Server 安装方拿到 `cluster-info.env`(包含 `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`compose 架构下 BINDIP/FTPIP 不再使用)。
## 二、解包
- `tar -xzf client_gpu_YYYYMMDD.tar.gz`
- 进入目录:`cd client_gpu_YYYYMMDD/`
- 你应当看到:`images/`GPU bundle、busybox`compose/``scripts/``docs/`
## 三、配置 config预热 overlay + 生成 .env
命令:
```
cp /path/to/cluster-info.env ./ # 或 export CLUSTER_INFO=/abs/path/cluster-info.env
./scripts/config.sh
```
脚本做了什么:
- 读取 `cluster-info.env``docker swarm join`(幂等);
- 自动用 busybox 预热 external overlay `argus-sys-net`,等待最多 60s 直到本机可见;
- 生成/更新 `compose/.env`:填入 `SWARM_*`,并“保留你已填写的 AGENT_* 与 GPU_NODE_HOSTNAME”不会覆盖
看到什么才算成功:
- 终端输出类似:`已预热 overlay=argus-sys-net 并生成 compose/.env可执行 scripts/install.sh`
- `compose/.env` 至少包含:
- `AGENT_ENV/AGENT_USER/AGENT_INSTANCE/GPU_NODE_HOSTNAME`(需要你提前填写);
- `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`
- `NODE_GPU_BUNDLE_IMAGE_TAG=...:YYYYMMDD`
### 日志映射(重要)
- 容器内 `/logs/infer``/logs/train` 已映射到包根 `./logs/infer``./logs/train`
- 你可以直接在宿主机查看推理/训练日志:`tail -f logs/infer/*.log``tail -f logs/train/*.log`
- install 脚本会自动创建这两个目录。
若提示缺少必填项:
- 打开 `compose/.env` 按提示补齐 `AGENT_*``GPU_NODE_HOSTNAME`,再次执行 `./scripts/config.sh`(脚本不会覆盖你已填的值)。
## 四、安装 install加载镜像 + 起容器 + 跟日志)
命令:
```
./scripts/install.sh
```
脚本做了什么:
- 如有必要,先自动预热 overlay
- 从 `images/` 导入 `argus-sys-metric-test-node-bundle-gpu-*.tar.gz` 到本地 Docker
- `docker compose up -d` 启动 GPU 节点容器,并自动执行 `docker logs -f argus-metric-gpu-node-swarm` 跟踪安装过程。
看到什么才算成功:
- 日志中出现:`[BOOT] local bundle install OK: version=...` / `dcgm-exporter ... listening` / `node state present: /private/argus/agent/<hostname>/node.json`
- `docker exec argus-metric-gpu-node-swarm nvidia-smi -L` 能列出 GPU
- 在 Server 侧 Prometheus `/api/v1/targets`GPU 节点 9100node-exporter与 9400dcgm-exporter至少其一 up。
## 五、卸载 uninstall
命令:
```
./scripts/uninstall.sh
```
行为Compose down如有 .env并删除 warmup 容器与节点容器。
## 六、常见问题
- `本机未看到 overlay`config/install 已自动预热;若仍失败,请检查与 manager 的网络连通性以及 manager 上是否已创建 `argus-sys-net`
- `busybox 缺失`:确保包根 `images/busybox.tar` 在,或主机已有 `busybox:latest`
- `加入 Swarm 失败`:确认 `cluster-info.env``SWARM_MANAGER_ADDR``SWARM_JOIN_TOKEN_WORKER` 正确,或在 manager 上重新 `docker swarm join-token -q worker` 后更新该文件。

View File

@ -1,90 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_EX="$PKG_ROOT/compose/.env.example"
ENV_OUT="$PKG_ROOT/compose/.env"
info(){ echo -e "\033[34m[CONFIG-GPU]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker curl jq awk sed tar gzip
require_compose
# 磁盘空间检查MB
check_disk(){ local p="$1"; local need=10240; local free
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
}
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
# 导入 cluster-info.env默认取当前包根也可用 CLUSTER_INFO 指定路径)
CI_IN="${CLUSTER_INFO:-$PKG_ROOT/cluster-info.env}"
info "读取 cluster-info.env: $CI_IN"
[[ -f "$CI_IN" ]] || { err "找不到 cluster-info.env默认当前包根或设置环境变量 CLUSTER_INFO 指定绝对路径)"; exit 1; }
set -a; source "$CI_IN"; set +a
[[ -n "${SWARM_MANAGER_ADDR:-}" && -n "${SWARM_JOIN_TOKEN_WORKER:-}" ]] || { err "cluster-info.env 缺少 SWARM 信息SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_WORKER"; exit 1; }
# 加入 Swarm幂等
info "加入 Swarm幂等$SWARM_MANAGER_ADDR"
docker swarm join --token "$SWARM_JOIN_TOKEN_WORKER" "$SWARM_MANAGER_ADDR":2377 >/dev/null 2>&1 || true
# 导入 busybox 并做 overlay 预热与连通性(总是执行)
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
# 准备 busybox
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then
info "加载 busybox.tar 以预热 overlay"
docker load -i "$PKG_ROOT/images/busybox.tar" >/dev/null
else
err "缺少 busybox 镜像(包内 images/busybox.tar 或本地 busybox:latest无法预热 overlay $NET_NAME"; exit 1
fi
fi
# 预热容器worker 侧加入 overlay 以便本地可见)
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
info "启动 warmup 容器加入 overlay: $NET_NAME"
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && { info "overlay 可见 (t=${i}s)"; break; }; sleep 1; done
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
# 通过 warmup 容器测试实际数据通路alias → master
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
err "warmup 容器内无法通过别名访问 master.argus.com请确认 server compose 已启动并加入 overlay $NET_NAME"
exit 1
fi
info "warmup 容器内可达 master.argus.comDocker DNS + alias 正常)"
# 生成/更新 .env保留人工填写项不覆盖已有键
if [[ ! -f "$ENV_OUT" ]]; then
cp "$ENV_EX" "$ENV_OUT"
fi
set_kv(){ local k="$1" v="$2"; if grep -q "^${k}=" "$ENV_OUT"; then sed -i -E "s#^${k}=.*#${k}=${v}#" "$ENV_OUT"; else echo "${k}=${v}" >> "$ENV_OUT"; fi }
set_kv SWARM_MANAGER_ADDR "${SWARM_MANAGER_ADDR:-}"
set_kv SWARM_JOIN_TOKEN_WORKER "${SWARM_JOIN_TOKEN_WORKER:-}"
set_kv SWARM_JOIN_TOKEN_MANAGER "${SWARM_JOIN_TOKEN_MANAGER:-}"
REQ_VARS=(AGENT_ENV AGENT_USER AGENT_INSTANCE GPU_NODE_HOSTNAME)
missing=()
for v in "${REQ_VARS[@]}"; do
val=$(grep -E "^$v=" "$ENV_OUT" | head -1 | cut -d= -f2-)
if [[ -z "$val" ]]; then missing+=("$v"); fi
done
if [[ ${#missing[@]} -gt 0 ]]; then
err "以下变量必须在 compose/.env 中填写:${missing[*]}(已保留你现有的内容,不会被覆盖)"; exit 1; fi
info "已生成 compose/.env可执行 scripts/install.sh"
# 准备并赋权宿主日志目录(幂等,便于安装前人工检查/预创建)
mkdir -p "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer"
chmod 1777 "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" || true
info "日志目录权限(期待 1777含粘滞位:"
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" 2>/dev/null || true

View File

@ -1,72 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
info(){ echo -e "\033[34m[INSTALL-GPU]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker nvidia-smi
require_compose
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env请先运行 scripts/config.sh"; exit 1; }
info "使用环境文件: $ENV_FILE"
# 预热 overlay当 config 执行很久之前或容器已被清理时warmup 可能不存在)
set -a; source "$ENV_FILE"; set +a
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
info "检查 overlay 网络可见性: $NET_NAME"
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
# 如 Overlay 不可见,尝试用 busybox 预热(仅为确保 worker 节点已加入 overlay
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then docker load -i "$PKG_ROOT/images/busybox.tar"; else err "缺少 busybox 镜像images/busybox.tar 或本地 busybox:latest"; exit 1; fi
fi
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && break; sleep 1; done
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
info "overlay 已可见warmup=argus-net-warmup"
fi
# 若本函数内重新创建了 warmup 容器,同样测试一次 alias 数据通路
if docker ps --format '{{.Names}}' | grep -q '^argus-net-warmup$'; then
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
err "GPU install 阶段warmup 容器内无法通过别名访问 master.argus.com请检查 overlay $NET_NAME 与 server 状态"
exit 1
fi
info "GPU install 阶段warmup 容器内可达 master.argus.com"
fi
# 导入 GPU bundle 镜像
IMG_TGZ=$(ls -1 "$PKG_ROOT"/images/argus-sys-metric-test-node-bundle-gpu-*.tar.gz 2>/dev/null | head -1 || true)
[[ -n "$IMG_TGZ" ]] || { err "找不到 GPU bundle 镜像 tar.gz"; exit 1; }
info "导入 GPU bundle 镜像: $(basename "$IMG_TGZ")"
tmp=$(mktemp); gunzip -c "$IMG_TGZ" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
# 确保日志目录存在(宿主侧,用于映射 /logs/infer 与 /logs/train并赋权 1777粘滞位
mkdir -p "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train"
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
info "日志目录已准备并赋权 1777: logs/infer logs/train"
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
# 启动 compose 并跟踪日志
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
info "启动 GPU 节点 (docker compose -p $PROJECT up -d)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
# 再次校准宿主日志目录权限,避免容器内脚本对 bind mount 权限回退
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
info "跟踪节点容器日志(按 Ctrl+C 退出)"
docker logs -f argus-metric-gpu-node-swarm || true

View File

@ -1,36 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
# load COMPOSE_PROJECT_NAME if provided in compose/.env
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
info(){ echo -e "\033[34m[UNINSTALL-GPU]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require_compose
if [[ -f "$ENV_FILE" ]]; then
info "stopping compose project (project=$PROJECT)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
else
info "compose/.env not found; attempting to remove container by name"
fi
# remove warmup container if still running
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
# remove node container if present
docker rm -f argus-metric-gpu-node-swarm >/dev/null 2>&1 || true
info "uninstall completed"

View File

@ -1,169 +0,0 @@
version: "3.8"
networks:
argus-sys-net:
external: true
services:
master:
image: ${MASTER_IMAGE_TAG:-argus-master:${PKG_VERSION}}
container_name: argus-master-sys
environment:
- OFFLINE_THRESHOLD_SECONDS=6
- ONLINE_THRESHOLD_SECONDS=2
- SCHEDULER_INTERVAL_SECONDS=1
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${MASTER_PORT:-32300}:3000"
volumes:
- ../private/argus/master:/private/argus/master
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- master.argus.com
restart: unless-stopped
es:
image: ${ES_IMAGE_TAG:-argus-elasticsearch:${PKG_VERSION}}
container_name: argus-es-sys
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms512m -Xmx512m
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/log/elasticsearch:/private/argus/log/elasticsearch
- ../private/argus/etc:/private/argus/etc
ports:
- "${ES_HTTP_PORT:-9200}:9200"
restart: unless-stopped
networks:
argus-sys-net:
aliases:
- es.log.argus.com
kibana:
image: ${KIBANA_IMAGE_TAG:-argus-kibana:${PKG_VERSION}}
container_name: argus-kibana-sys
environment:
- ELASTICSEARCH_HOSTS=http://es.log.argus.com:9200
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/log/kibana:/private/argus/log/kibana
- ../private/argus/etc:/private/argus/etc
depends_on: [es]
ports:
- "${KIBANA_PORT:-5601}:5601"
restart: unless-stopped
networks:
argus-sys-net:
aliases:
- kibana.log.argus.com
prometheus:
image: ${PROM_IMAGE_TAG:-argus-metric-prometheus:${PKG_VERSION}}
container_name: argus-prometheus
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- PROMETHEUS_BASE_PATH=/private/argus/metric/prometheus
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
volumes:
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- prom.metric.argus.com
grafana:
image: ${GRAFANA_IMAGE_TAG:-argus-metric-grafana:${PKG_VERSION}}
container_name: argus-grafana
restart: unless-stopped
environment:
- TZ=Asia/Shanghai
- GRAFANA_BASE_PATH=/private/argus/metric/grafana
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- GF_SERVER_HTTP_PORT=3000
- GF_LOG_LEVEL=warn
- GF_LOG_MODE=console
- GF_PATHS_PROVISIONING=/private/argus/metric/grafana/provisioning
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
ports:
- "${GRAFANA_PORT:-3000}:3000"
volumes:
- ../private/argus/metric/grafana:/private/argus/metric/grafana
- ../private/argus/etc:/private/argus/etc
depends_on: [prometheus]
networks:
argus-sys-net:
aliases:
- grafana.metric.argus.com
alertmanager:
image: ${ALERT_IMAGE_TAG:-argus-alertmanager:${PKG_VERSION}}
container_name: argus-alertmanager
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/etc:/private/argus/etc
- ../private/argus/alert/alertmanager:/private/argus/alert/alertmanager
networks:
argus-sys-net:
aliases:
- alertmanager.alert.argus.com
ports:
- "${ALERTMANAGER_PORT:-9093}:9093"
restart: unless-stopped
web-frontend:
image: ${FRONT_IMAGE_TAG:-argus-web-frontend:${PKG_VERSION}}
container_name: argus-web-frontend
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
- EXTERNAL_MASTER_PORT=${WEB_PROXY_PORT_8085:-8085}
- EXTERNAL_ALERTMANAGER_PORT=${WEB_PROXY_PORT_8084:-8084}
- EXTERNAL_GRAFANA_PORT=${WEB_PROXY_PORT_8081:-8081}
- EXTERNAL_PROMETHEUS_PORT=${WEB_PROXY_PORT_8082:-8082}
- EXTERNAL_KIBANA_PORT=${WEB_PROXY_PORT_8083:-8083}
volumes:
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- web.argus.com
restart: unless-stopped
web-proxy:
image: ${WEB_PROXY_IMAGE_TAG:-argus-web-proxy:${PKG_VERSION}}
container_name: argus-web-proxy
depends_on: [master, grafana, prometheus, kibana, alertmanager]
environment:
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
volumes:
- ../private/argus/etc:/private/argus/etc
networks:
argus-sys-net:
aliases:
- proxy.argus.com
ports:
- "${WEB_PROXY_PORT_8080:-8080}:8080"
- "${WEB_PROXY_PORT_8081:-8081}:8081"
- "${WEB_PROXY_PORT_8082:-8082}:8082"
- "${WEB_PROXY_PORT_8083:-8083}:8083"
- "${WEB_PROXY_PORT_8084:-8084}:8084"
- "${WEB_PROXY_PORT_8085:-8085}:8085"
restart: unless-stopped

View File

@ -1,102 +0,0 @@
# Argus Server 安装指南deployment_new
适用:通过 Server 安装包在 Docker Swarm + external overlay 网络一体化部署 Argus 服务端组件。
—— 本文强调“怎么做、看什么、符合了才继续”。
## 一、准备条件(开始前确认)
- Docker 与 Docker Compose v2 已安装;`docker info` 正常;`docker compose version` 可执行。
- 具备 root/sudo 权限;磁盘可用空间 ≥ 10GB包根与 `/var/lib/docker`)。
- 你知道本机管理地址SWARM_MANAGER_ADDR该 IP 属于本机某网卡,可被其他节点访问。
- 很重要:以统一账户 `argus`UID=2133GID=2015执行后续安装与运维并将其加入 `docker` 组;示例命令如下(如需不同 UID/GID请替换为贵方标准
```bash
# 1) 创建主组GID=2015组名 argus若已存在可跳过
sudo groupadd --gid 2015 argus || true
# 2) 创建用户 argusUID=2133、主组 GID=2015创建家目录并用 bash 作为默认 shell若已存在可用 usermod 调整)
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
sudo passwd argus
# 3) 将 argus 加入 docker 组,使其能调用 Docker Daemon新登录后生效
sudo usermod -aG docker argus
# 4) 验证(重新登录或执行 newgrp docker 使组生效)
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
```
后续的解压与执行config/install/selfcheck 等)均使用该 `argus` 账户进行。
## 二、解包与目录结构
- 解压:`tar -xzf server_YYYYMMDD.tar.gz`
- 进入:`cd server_YYYYMMDD/`
- 你应当能看到:
- `images/`(逐服务镜像 tar.gz`argus-master-YYYYMMDD.tar.gz`
- `compose/``docker-compose.yml``.env.example`
- `scripts/`(安装/运维脚本)
- `private/argus/`(数据与配置骨架)
- `docs/`(中文文档)
## 三、配置 config生成 .env 与 SWARM_MANAGER_ADDR
命令:
```
export SWARM_MANAGER_ADDR=<本机管理IP>
./scripts/config.sh
```
脚本做了什么:
- 检查依赖与磁盘空间;
- 自动从“端口 20000 起”分配所有服务端口,确保“系统未占用”且“彼此不冲突”;
- 写入 `compose/.env`(包含端口、镜像 tag、overlay 名称与 UID/GID 等);
- 将当前执行账户的 UID/GID 写入 `ARGUS_BUILD_UID/GID`(若主组名是 docker会改用“与用户名同名的组”的 GID避免拿到 docker 组 999
- 更新/追加 `cluster-info.env` 中的 `SWARM_MANAGER_ADDR`(不会覆盖其他键)。
看到什么才算成功:
- 终端输出:`已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。`
- `compose/.env` 打开应当看到:
- 端口均 ≥20000 且没有重复;
- `ARGUS_BUILD_UID/GID``id -u/-g` 一致;
- `SWARM_MANAGER_ADDR=<你的IP>`
遇到问题:
- 端口被异常占用:可删去 `.env` 后再次执行 `config.sh`,或手工编辑端口再执行 `install.sh`
## 四、安装 install一次到位
命令:
```
./scripts/install.sh
```
脚本做了什么:
- 若 Swarm 未激活:执行 `docker swarm init --advertise-addr $SWARM_MANAGER_ADDR`
- 确保 external overlay `argus-sys-net` 存在;
- 导入 `images/*.tar.gz` 到本机 Docker
- `docker compose up -d` 启动服务;
- 等待“六项就绪”:
- Master `/readyz`=200、ES `/_cluster/health`=200、Prometheus TCP 可达、Grafana `/api/health`=200、Alertmanager `/api/v2/status`=200、Kibana `/api/status` level=available
- 校验 Docker DNS + overlay alias`argus-web-proxy` 内通过 `getent hosts``curl` 检查 `master.argus.com``grafana.metric.argus.com` 等域名连通性;
- 写出 `cluster-info.env`(含 `SWARM_JOIN_TOKEN_{WORKER,MANAGER}/SWARM_MANAGER_ADDR`compose 架构下不再依赖 BINDIP/FTPIP
- 生成 `安装报告_YYYYMMDD-HHMMSS.md`(端口、健康检查摘要与提示)。
看到什么才算成功:
- `docker compose ps` 全部是 Up
- `安装报告_…md` 中各项 HTTP 检查为 200/available
- `cluster-info.env` 包含五个关键键:
- `SWARM_MANAGER_ADDR=...`
- `SWARM_MANAGER_ADDR=...` `SWARM_JOIN_TOKEN_*=...`
- `SWARM_JOIN_TOKEN_WORKER=SWMTKN-...`
- `SWARM_JOIN_TOKEN_MANAGER=SWMTKN-...`
## 五、健康自检与常用操作
- 健康自检:`./scripts/selfcheck.sh`
- 期望输出:`selfcheck OK -> logs/selfcheck.json`
- 文件 `logs/selfcheck.json``overlay_net/es/kibana/master_readyz/prometheus/grafana/alertmanager/web_proxy_cors` 为 true。
- 状态:`./scripts/status.sh`(相当于 `docker compose ps`)。
- 诊断:`./scripts/diagnose.sh`(收集容器/HTTP/CORS/ES 细节,输出到 `logs/diagnose_*.log`)。
- 卸载:`./scripts/uninstall.sh`Compose down
- ES 磁盘水位临时放宽/还原:`./scripts/es-watermark-relax.sh` / `./scripts/es-watermark-restore.sh`
## 六、下一步:分发 cluster-info.env 给 Client
- 将 `cluster-info.env` 拷贝给安装 Client 的同事;
- 对方在 Client 机器的包根放置该文件(或设置 `CLUSTER_INFO=/绝对路径`)即可。
## 七、故障排查快览
- Proxy 502 或 8080 连接复位:通常是 overlay alias 未生效或 web-proxy 尚未解析到其它服务;重跑 `install.sh`(会重启栈并在容器内校验 DNS或查看 `logs/diagnose_error.log`
- Kibana 不 available等待 12 分钟、查看 `argus-kibana-sys` 日志;
- cluster-info.env 的 SWARM_MANAGER_ADDR 为空:重新 `export SWARM_MANAGER_ADDR=<IP>; ./scripts/config.sh``./scripts/install.sh`(会回读 `.env` 补写)。

View File

@ -1,7 +0,0 @@
# Docker Swarm 部署要点
- 初始化 Swarm`docker swarm init --advertise-addr <SWARM_MANAGER_ADDR>`
- 创建 overlay`docker network create --driver overlay --attachable argus-sys-net`
- Server 包 `install.sh` 自动完成上述操作;如需手动执行,确保 `argus-sys-net` 存在且 attachable。
- Worker 节点加入:`docker swarm join --token <worker_token> <SWARM_MANAGER_ADDR>:2377`

View File

@ -1,11 +0,0 @@
# 故障排查Server
- 端口占用:查看 `安装报告_*.md` 中端口表;如需修改,编辑 `compose/.env` 后执行 `docker compose ... up -d`
- 组件未就绪:
- Master: `curl http://127.0.0.1:${MASTER_PORT}/readyz -I`
- ES: `curl http://127.0.0.1:${ES_HTTP_PORT}/_cluster/health`
- Grafana: `curl http://127.0.0.1:${GRAFANA_PORT}/api/health`
- Prometheus TCP: `exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT}`
- 域名解析:进入 `argus-web-proxy``argus-master-sys` 容器:`getent hosts master.argus.com`
- Swarm/Overlay检查 `docker network ls | grep argus-sys-net`,或 `docker node ls`

View File

@ -1,108 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_EX="$PKG_ROOT/compose/.env.example"
ENV_OUT="$PKG_ROOT/compose/.env"
info(){ echo -e "\033[34m[CONFIG]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker curl jq awk sed tar gzip
require_compose
# 磁盘空间检查MB
check_disk(){ local p="$1"; local need=10240; local free
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
}
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
# 读取/生成 SWARM_MANAGER_ADDR
SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}
if [[ -z "${SWARM_MANAGER_ADDR}" ]]; then
read -rp "请输入本机管理地址 SWARM_MANAGER_ADDR: " SWARM_MANAGER_ADDR
fi
info "SWARM_MANAGER_ADDR=$SWARM_MANAGER_ADDR"
# 校验 IP 属于本机网卡
if ! ip -o addr | awk '{print $4}' | cut -d'/' -f1 | grep -qx "$SWARM_MANAGER_ADDR"; then
err "SWARM_MANAGER_ADDR 非本机地址: $SWARM_MANAGER_ADDR"; exit 1; fi
info "开始分配服务端口(起始=20000避免系统占用与相互冲突"
is_port_used(){ local p="$1"; ss -tulnH 2>/dev/null | awk '{print $5}' | sed 's/.*://g' | grep -qx "$p"; }
declare -A PRESENT=() CHOSEN=() USED=()
START_PORT="${START_PORT:-20000}"; cur=$START_PORT
ORDER=(MASTER_PORT ES_HTTP_PORT KIBANA_PORT PROMETHEUS_PORT GRAFANA_PORT ALERTMANAGER_PORT \
WEB_PROXY_PORT_8080 WEB_PROXY_PORT_8081 WEB_PROXY_PORT_8082 WEB_PROXY_PORT_8083 WEB_PROXY_PORT_8084 WEB_PROXY_PORT_8085 \
FTP_PORT FTP_DATA_PORT)
# 标记 .env.example 中实际存在的键
for key in "${ORDER[@]}"; do
if grep -q "^${key}=" "$ENV_EX"; then PRESENT[$key]=1; fi
done
next_free(){ local p="$1"; while :; do if [[ -n "${USED[$p]:-}" ]] || is_port_used "$p"; then p=$((p+1)); else echo "$p"; return; fi; done; }
for key in "${ORDER[@]}"; do
[[ -z "${PRESENT[$key]:-}" ]] && continue
p=$(next_free "$cur"); CHOSEN[$key]="$p"; USED[$p]=1; cur=$((p+1))
done
info "端口分配结果MASTER=${CHOSEN[MASTER_PORT]:-} ES=${CHOSEN[ES_HTTP_PORT]:-} KIBANA=${CHOSEN[KIBANA_PORT]:-} PROM=${CHOSEN[PROMETHEUS_PORT]:-} GRAFANA=${CHOSEN[GRAFANA_PORT]:-} ALERT=${CHOSEN[ALERTMANAGER_PORT]:-} WEB_PROXY(8080..8085)=${CHOSEN[WEB_PROXY_PORT_8080]:-}/${CHOSEN[WEB_PROXY_PORT_8081]:-}/${CHOSEN[WEB_PROXY_PORT_8082]:-}/${CHOSEN[WEB_PROXY_PORT_8083]:-}/${CHOSEN[WEB_PROXY_PORT_8084]:-}/${CHOSEN[WEB_PROXY_PORT_8085]:-}"
cp "$ENV_EX" "$ENV_OUT"
# 覆盖端口(按唯一化结果写回)
for key in "${ORDER[@]}"; do
val="${CHOSEN[$key]:-}"
[[ -z "$val" ]] && continue
sed -i -E "s#^$key=.*#$key=${val}#" "$ENV_OUT"
done
info "已写入 compose/.env 的端口配置"
# 覆盖/补充 Overlay 名称
grep -q '^ARGUS_OVERLAY_NET=' "$ENV_OUT" || echo 'ARGUS_OVERLAY_NET=argus-sys-net' >> "$ENV_OUT"
# 以当前执行账户 UID/GID 写入(避免误选 docker 组)
RUID=$(id -u)
PRIMARY_GID=$(id -g)
PRIMARY_GRP=$(id -gn)
USER_NAME=$(id -un)
# 若主组名被解析为 docker尝试用与用户名同名的组的 GID否则回退主 GID
if [[ "$PRIMARY_GRP" == "docker" ]]; then
RGID=$(getent group "$USER_NAME" | awk -F: '{print $3}' 2>/dev/null || true)
[[ -z "$RGID" ]] && RGID="$PRIMARY_GID"
else
RGID="$PRIMARY_GID"
fi
info "使用构建账户 UID:GID=${RUID}:${RGID} (user=$USER_NAME primary_group=$PRIMARY_GRP)"
if grep -q '^ARGUS_BUILD_UID=' "$ENV_OUT"; then
sed -i -E "s#^ARGUS_BUILD_UID=.*#ARGUS_BUILD_UID=${RUID}#" "$ENV_OUT"
else
echo "ARGUS_BUILD_UID=${RUID}" >> "$ENV_OUT"
fi
if grep -q '^ARGUS_BUILD_GID=' "$ENV_OUT"; then
sed -i -E "s#^ARGUS_BUILD_GID=.*#ARGUS_BUILD_GID=${RGID}#" "$ENV_OUT"
else
echo "ARGUS_BUILD_GID=${RGID}" >> "$ENV_OUT"
fi
CI="$PKG_ROOT/cluster-info.env"
if [[ -f "$CI" ]]; then
if grep -q '^SWARM_MANAGER_ADDR=' "$CI"; then
sed -i -E "s#^SWARM_MANAGER_ADDR=.*#SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}#" "$CI"
else
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" >> "$CI"
fi
else
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" > "$CI"
fi
info "已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。"
info "下一步可执行: scripts/install.sh"

View File

@ -1,109 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
ts="$(date -u +%Y%m%d-%H%M%SZ)"
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
if ! ( : > "$LOG_DIR/.w" 2>/dev/null ); then LOG_DIR="/tmp/argus-logs"; mkdir -p "$LOG_DIR" || true; fi
# load compose project for accurate ps output
ENV_FILE="$ROOT/compose/.env"
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
DETAILS="$LOG_DIR/diagnose_details_${ts}.log"; ERRORS="$LOG_DIR/diagnose_error_${ts}.log"; : > "$DETAILS"; : > "$ERRORS"
logd() { echo "$(date '+%F %T') $*" >> "$DETAILS"; }
append_err() { echo "$*" >> "$ERRORS"; }
http_code() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
http_body_head() { curl -s --max-time 3 "$1" 2>/dev/null | sed -n '1,5p' || true; }
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
section() { local name="$1"; logd "===== [$name] ====="; }
svc() {
local svc_name="$1"; local cname="$2"; shift 2
section "$svc_name ($cname)"
logd "docker ps:"; docker ps -a --format '{{.Names}}\t{{.Status}}\t{{.Image}}' | awk -v n="$cname" '$1==n' >> "$DETAILS" || true
logd "docker inspect:"; docker inspect -f '{{.State.Status}} rc={{.RestartCount}} started={{.State.StartedAt}}' "$cname" >> "$DETAILS" 2>&1 || true
logd "last 200 container logs:"; docker logs --tail 200 "$cname" >> "$DETAILS" 2>&1 || true
docker logs --tail 200 "$cname" 2>&1 | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][container] /" >> "$ERRORS" || true
if docker exec "$cname" sh -lc 'command -v supervisorctl >/dev/null 2>&1' >/dev/null 2>&1; then
logd "supervisorctl status:"; docker exec "$cname" sh -lc 'supervisorctl status' >> "$DETAILS" 2>&1 || true
local files; files=$(docker exec "$cname" sh -lc 'ls /var/log/supervisor/*.log 2>/dev/null' || true)
for f in $files; do
logd "tail -n 80 $f:"; docker exec "$cname" sh -lc "tail -n 80 $f 2>/dev/null || true" >> "$DETAILS" 2>&1 || true
docker exec "$cname" sh -lc "tail -n 200 $f 2>/dev/null" 2>/dev/null | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][supervisor:$(basename $f)] /" >> "$ERRORS" || true
done
fi
}
svc master argus-master-sys
svc es argus-es-sys
svc kibana argus-kibana-sys
svc prometheus argus-prometheus
svc grafana argus-grafana
svc alertmanager argus-alertmanager
svc web-frontend argus-web-frontend
svc web-proxy argus-web-proxy
section HTTP
logd "ES: $(http_code \"http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health\")"; http_body_head "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" >> "$DETAILS" 2>&1 || true
logd "Kibana: $(http_code \"http://localhost:${KIBANA_PORT:-5601}/api/status\")"; http_body_head "http://localhost:${KIBANA_PORT:-5601}/api/status" >> "$DETAILS" 2>&1 || true
logd "Master readyz: $(http_code \"http://localhost:${MASTER_PORT:-32300}/readyz\")"
logd "Prometheus: $(http_code \"http://localhost:${PROMETHEUS_PORT:-9090}/-/ready\")"
logd "Grafana: $(http_code \"http://localhost:${GRAFANA_PORT:-3000}/api/health\")"; http_body_head "http://localhost:${GRAFANA_PORT:-3000}/api/health" >> "$DETAILS" 2>&1 || true
logd "Alertmanager: $(http_code \"http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status\")"
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
logd "Web-Proxy 8080: $(http_code \"http://localhost:${WEB_PROXY_PORT_8080:-8080}/\")"
logd "Web-Proxy 8083: $(http_code \"http://localhost:${WEB_PROXY_PORT_8083:-8083}/\")"
logd "Web-Proxy 8084 CORS: ${cors8084}"
logd "Web-Proxy 8085 CORS: ${cors8085}"
section ES-CHECKS
ch=$(curl -s --max-time 3 "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" || true)
status=$(printf '%s' "$ch" | awk -F'\"' '/"status"/{print $4; exit}')
if [[ -n "$status" ]]; then logd "cluster.status=$status"; fi
if [[ "$status" != "green" ]]; then append_err "[es][cluster] status=$status"; fi
if docker ps --format '{{.Names}}' | grep -q '^argus-es-sys$'; then
duse=$(docker exec argus-es-sys sh -lc 'df -P /usr/share/elasticsearch/data | awk "NR==2{print \$5}"' 2>/dev/null || true)
logd "es.data.df_use=$duse"; usep=${duse%%%}
if [[ -n "$usep" ]] && (( usep >= 90 )); then append_err "[es][disk] data path usage=${usep}%"; fi
fi
section DNS-IN-PROXY
for d in master.argus.com es.log.argus.com kibana.log.argus.com grafana.metric.argus.com prom.metric.argus.com alertmanager.alert.argus.com; do
docker exec argus-web-proxy sh -lc "getent hosts $d || nslookup $d 2>/dev/null | tail -n+1" >> "$DETAILS" 2>&1 || true
done
logd "HTTP (web-proxy): master.readyz=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://master.argus.com:3000/readyz\" 2>/dev/null || echo 000)"
logd "HTTP (web-proxy): es.health=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://es.log.argus.com:9200/_cluster/health\" 2>/dev/null || echo 000)"
logd "HTTP (web-proxy): kibana.status=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://kibana.log.argus.com:5601/api/status\" 2>/dev/null || echo 000)"
section SYSTEM
logd "uname -a:"; uname -a >> "$DETAILS"
logd "docker version:"; docker version --format '{{.Server.Version}}' >> "$DETAILS" 2>&1 || true
logd "compose ps (project=$PROJECT):"; (cd "$ROOT/compose" && docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f docker-compose.yml ps) >> "$DETAILS" 2>&1 || true
section SUMMARY
[[ $(http_code "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health") != 200 ]] && echo "[es][http] health not 200" >> "$ERRORS"
kbcode=$(http_code "http://localhost:${KIBANA_PORT:-5601}/api/status"); [[ "$kbcode" != 200 ]] && echo "[kibana][http] /api/status=$kbcode" >> "$ERRORS"
[[ $(http_code "http://localhost:${MASTER_PORT:-32300}/readyz") != 200 ]] && echo "[master][http] /readyz not 200" >> "$ERRORS"
[[ $(http_code "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready") != 200 ]] && echo "[prometheus][http] /-/ready not 200" >> "$ERRORS"
gfcode=$(http_code "http://localhost:${GRAFANA_PORT:-3000}/api/health"); [[ "$gfcode" != 200 ]] && echo "[grafana][http] /api/health=$gfcode" >> "$ERRORS"
[[ $(http_code "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status") != 200 ]] && echo "[alertmanager][http] /api/v2/status not 200" >> "$ERRORS"
[[ -z "$cors8084" ]] && echo "[web-proxy][cors] 8084 missing Access-Control-Allow-Origin" >> "$ERRORS"
[[ -z "$cors8085" ]] && echo "[web-proxy][cors] 8085 missing Access-Control-Allow-Origin" >> "$ERRORS"
sort -u -o "$ERRORS" "$ERRORS"
echo "Diagnostic details -> $DETAILS"
echo "Detected errors -> $ERRORS"
if [[ "$LOG_DIR" == "$ROOT/logs" ]]; then
ln -sfn "$(basename "$DETAILS")" "$ROOT/logs/diagnose_details.log" 2>/dev/null || cp "$DETAILS" "$ROOT/logs/diagnose_details.log" 2>/dev/null || true
ln -sfn "$(basename "$ERRORS")" "$ROOT/logs/diagnose_error.log" 2>/dev/null || cp "$ERRORS" "$ROOT/logs/diagnose_error.log" 2>/dev/null || true
fi
exit 0

View File

@ -1,11 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
HOST="${1:-http://127.0.0.1:9200}"
echo "设置 ES watermark 为 95%/96%/97%: $HOST"
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "95%",
"cluster.routing.allocation.disk.watermark.high": "96%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
}
}' && echo "\nOK"

View File

@ -1,11 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
HOST="${1:-http://127.0.0.1:9200}"
echo "恢复 ES watermark 为默认值: $HOST"
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
}
}' && echo "\nOK"

View File

@ -1,137 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
info(){ echo -e "\033[34m[INSTALL]\033[0m $*"; }
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/devnull 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require docker curl jq awk sed tar gzip
require_compose
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env请先运行 scripts/config.sh"; exit 1; }
info "使用环境文件: $ENV_FILE"
set -a; source "$ENV_FILE"; set +a
# 兼容:若 .env 未包含 SWARM_MANAGER_ADDR则从已存在的 cluster-info.env 读取以避免写空
SMADDR="${SWARM_MANAGER_ADDR:-}"
CI_FILE="$PKG_ROOT/cluster-info.env"
if [[ -z "$SMADDR" && -f "$CI_FILE" ]]; then
SMADDR=$(sed -n 's/^SWARM_MANAGER_ADDR=\(.*\)$/\1/p' "$CI_FILE" | head -n1)
fi
SWARM_MANAGER_ADDR="$SMADDR"
# Swarm init & overlay
if ! docker info 2>/dev/null | grep -q "Swarm: active"; then
[[ -n "${SWARM_MANAGER_ADDR:-}" ]] || { err "SWARM_MANAGER_ADDR 未设置,请在 scripts/config.sh 中配置"; exit 1; }
info "初始化 Swarm (--advertise-addr $SWARM_MANAGER_ADDR)"
docker swarm init --advertise-addr "$SWARM_MANAGER_ADDR" >/dev/null 2>&1 || true
else
info "Swarm 已激活"
fi
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
info "创建 overlay 网络: $NET_NAME"
docker network create -d overlay --attachable "$NET_NAME" >/dev/null
else
info "overlay 网络已存在: $NET_NAME"
fi
# Load images
IMAGES_DIR="$PKG_ROOT/images"
shopt -s nullglob
tars=("$IMAGES_DIR"/*.tar.gz)
if [[ ${#tars[@]} -eq 0 ]]; then err "images 目录为空,缺少镜像 tar.gz"; exit 1; fi
total=${#tars[@]}; idx=0
for tgz in "${tars[@]}"; do
idx=$((idx+1))
info "导入镜像 ($idx/$total): $(basename "$tgz")"
tmp=$(mktemp); gunzip -c "$tgz" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
done
shopt -u nullglob
# Compose up
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
info "启动服务栈 (docker compose -p $PROJECT up -d)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
# Wait readiness (best-effort)
code(){ curl -4 -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
prom_ok(){ (exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT:-9090}) >/dev/null 2>&1 && return 0 || return 1; }
kb_ok(){ local body; body=$(curl -s "http://127.0.0.1:${KIBANA_PORT:-5601}/api/status" || true); echo "$body" | grep -q '"level"\s*:\s*"available"'; }
RETRIES=${RETRIES:-60}; SLEEP=${SLEEP:-5}; ok=0
info "等待基础服务就绪 (<= $((RETRIES*SLEEP))s)"
for i in $(seq 1 "$RETRIES"); do
e1=$(code "http://127.0.0.1:${MASTER_PORT:-32300}/readyz")
e2=$(code "http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health")
e3=000; prom_ok && e3=200
e4=$(code "http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health")
e5=$(code "http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status")
e6=$(kb_ok && echo 200 || echo 000)
info "[ready] t=$((i*SLEEP))s master=$e1 es=$e2 prom=$e3 graf=$e4 alert=$e5 kibana=$e6"
[[ "$e1" == 200 ]] && ok=$((ok+1))
[[ "$e2" == 200 ]] && ok=$((ok+1))
[[ "$e3" == 200 ]] && ok=$((ok+1))
[[ "$e4" == 200 ]] && ok=$((ok+1))
[[ "$e5" == 200 ]] && ok=$((ok+1))
[[ "$e6" == 200 ]] && ok=$((ok+1))
if [[ $ok -ge 6 ]]; then break; fi; ok=0; sleep "$SLEEP"
done
[[ $ok -ge 6 ]] || err "部分服务未就绪(可稍后重试 selfcheck"
# Swarm join tokens
TOKEN_WORKER=$(docker swarm join-token -q worker 2>/dev/null || echo "")
TOKEN_MANAGER=$(docker swarm join-token -q manager 2>/dev/null || echo "")
# cluster-info.envcompose 场景下不再依赖 BINDIP/FTPIP
CI="$PKG_ROOT/cluster-info.env"
info "写入 cluster-info.env (manager/token)"
{
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
echo "SWARM_JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
echo "SWARM_JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
} > "$CI"
info "已输出 $CI"
# 安装报告
ts=$(date +%Y%m%d-%H%M%S)
RPT="$PKG_ROOT/安装报告_${ts}.md"
{
echo "# Argus Server 安装报告 (${ts})"
echo
echo "## 端口映射"
echo "- MASTER_PORT=${MASTER_PORT}"
echo "- ES_HTTP_PORT=${ES_HTTP_PORT}"
echo "- KIBANA_PORT=${KIBANA_PORT}"
echo "- PROMETHEUS_PORT=${PROMETHEUS_PORT}"
echo "- GRAFANA_PORT=${GRAFANA_PORT}"
echo "- ALERTMANAGER_PORT=${ALERTMANAGER_PORT}"
echo "- WEB_PROXY_PORT_8080=${WEB_PROXY_PORT_8080} ... 8085=${WEB_PROXY_PORT_8085}"
echo
echo "## Swarm/Overlay"
echo "- SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
echo "- NET=${NET_NAME}"
echo "- JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
echo "- JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
echo
echo "## 健康检查(简要)"
echo "- master/readyz=$(code http://127.0.0.1:${MASTER_PORT:-32300}/readyz)"
echo "- es/_cluster/health=$(code http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health)"
echo "- grafana/api/health=$(code http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health)"
echo "- prometheus/tcp=$([[ $(prom_ok; echo $?) == 0 ]] && echo 200 || echo 000)"
echo "- alertmanager/api/v2/status=$(code http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status)"
echo "- kibana/api/status=$([[ $(kb_ok; echo $?) == 0 ]] && echo available || echo not-ready)"
} > "$RPT"
info "已生成报告: $RPT"
info "安装完成。可将 cluster-info.env 分发给 Client-GPU 安装方。"
docker exec argus-web-proxy nginx -t >/dev/null 2>&1 && docker exec argus-web-proxy nginx -s reload >/dev/null 2>&1 || true

View File

@ -1,83 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
log() { echo -e "\033[0;34m[CHECK]\033[0m $*"; }
err() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
wait_http() { local url="$1"; local attempts=${2:-120}; local i=1; while ((i<=attempts)); do curl -fsS "$url" >/dev/null 2>&1 && return 0; echo "[..] waiting $url ($i/$attempts)"; sleep 5; ((i++)); done; return 1; }
code_for() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
OUT_JSON="$LOG_DIR/selfcheck.json"; tmp=$(mktemp)
ok=1
log "checking overlay network"
net_ok=false
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" >/dev/null 2>&1; then
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" | grep -q '"Driver": "overlay"'; then net_ok=true; fi
fi
[[ "$net_ok" == true ]] || ok=0
log "checking Elasticsearch"
wait_http "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" 60 || ok=0
log "checking Kibana"
kb_code=$(code_for "http://localhost:${KIBANA_PORT:-5601}/api/status")
kb_ok=false
if [[ "$kb_code" == 200 ]]; then
body=$(curl -sS "http://localhost:${KIBANA_PORT:-5601}/api/status" || true)
echo "$body" | grep -q '"level"\s*:\s*"available"' && kb_ok=true
fi
[[ "$kb_ok" == true ]] || ok=0
log "checking Master"
[[ $(code_for "http://localhost:${MASTER_PORT:-32300}/readyz") == 200 ]] || ok=0
log "checking Prometheus"
wait_http "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready" 60 || ok=0
log "checking Grafana"
gf_code=$(code_for "http://localhost:${GRAFANA_PORT:-3000}/api/health")
gf_ok=false; if [[ "$gf_code" == 200 ]]; then body=$(curl -sS "http://localhost:${GRAFANA_PORT:-3000}/api/health" || true); echo "$body" | grep -q '"database"\s*:\s*"ok"' && gf_ok=true; fi
[[ "$gf_ok" == true ]] || ok=0
log "checking Alertmanager"
wait_http "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status" 60 || ok=0
log "checking Web-Proxy (CORS)"
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
wp_ok=true
[[ -n "$cors8084" && -n "$cors8085" ]] || wp_ok=false
[[ "$wp_ok" == true ]] || ok=0
cat > "$tmp" <<JSON
{
"overlay_net": $net_ok,
"es": true,
"kibana": $kb_ok,
"master_readyz": true,
"prometheus": true,
"grafana": $gf_ok,
"alertmanager": true,
"web_proxy_cors": $wp_ok,
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
JSON
mv "$tmp" "$OUT_JSON" 2>/dev/null || cp "$tmp" "$OUT_JSON"
if [[ "$ok" == 1 ]]; then
log "selfcheck OK -> $OUT_JSON"
exit 0
else
err "selfcheck FAILED -> $OUT_JSON"
exit 1
fi

View File

@ -1,9 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps

View File

@ -1,23 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PKG_ROOT="$ROOT_DIR"
ENV_FILE="$PKG_ROOT/compose/.env"
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
# load COMPOSE_PROJECT_NAME from env file if present
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
# Compose 检测:优先 docker composev2回退 docker-composev1
require_compose(){
if docker compose version >/dev/null 2>&1; then return 0; fi
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
err "未检测到 Docker Compose请安装 docker compose v2 或 docker-compose v1"; exit 1
}
require_compose
echo "[UNINSTALL] stopping compose (project=$PROJECT)"
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
echo "[UNINSTALL] done"

Binary file not shown.

View File

@ -37,11 +37,22 @@ _argus_is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
_argus_read_user_from_files() {
local uid_out_var="$1" gid_out_var="$2"; shift 2
local uid_val="$ARGUS_BUILD_UID_DEFAULT" gid_val="$ARGUS_BUILD_GID_DEFAULT"
local config
for config in "$@"; do
load_build_user() {
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
return 0
fi
local project_root config_files config uid gid
project_root="$(argus_project_root)"
config_files=(
"$project_root/configs/build_user.local.conf"
"$project_root/configs/build_user.conf"
)
uid="$ARGUS_BUILD_UID_DEFAULT"
gid="$ARGUS_BUILD_GID_DEFAULT"
for config in "${config_files[@]}"; do
if [[ -f "$config" ]]; then
while IFS= read -r raw_line || [[ -n "$raw_line" ]]; do
local line key value
@ -57,58 +68,42 @@ _argus_read_user_from_files() {
key="$(_argus_trim "$key")"
value="$(_argus_trim "$value")"
case "$key" in
UID) uid_val="$value" ;;
GID) gid_val="$value" ;;
*) echo "[ARGUS build_user] Unknown key '$key' in $config" >&2 ;;
UID)
uid="$value"
;;
GID)
gid="$value"
;;
*)
echo "[ARGUS build_user] Unknown key '$key' in $config" >&2
;;
esac
done < "$config"
break
fi
done
printf -v "$uid_out_var" '%s' "$uid_val"
printf -v "$gid_out_var" '%s' "$gid_val"
}
load_build_user_profile() {
local profile="${1:-default}"
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
return 0
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then
uid="$ARGUS_BUILD_UID"
fi
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then
gid="$ARGUS_BUILD_GID"
fi
local project_root uid gid
project_root="$(argus_project_root)"
case "$profile" in
pkg)
_argus_read_user_from_files uid gid \
"$project_root/configs/build_user.pkg.conf" \
"$project_root/configs/build_user.local.conf" \
"$project_root/configs/build_user.conf"
;;
default|*)
_argus_read_user_from_files uid gid \
"$project_root/configs/build_user.local.conf" \
"$project_root/configs/build_user.conf"
;;
esac
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then uid="$ARGUS_BUILD_UID"; fi
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then gid="$ARGUS_BUILD_GID"; fi
if ! _argus_is_number "$uid"; then
echo "[ARGUS build_user] Invalid UID '$uid'" >&2; return 1
echo "[ARGUS build_user] Invalid UID '$uid'" >&2
return 1
fi
if ! _argus_is_number "$gid"; then
echo "[ARGUS build_user] Invalid GID '$gid'" >&2; return 1
echo "[ARGUS build_user] Invalid GID '$gid'" >&2
return 1
fi
export ARGUS_BUILD_UID="$uid"
export ARGUS_BUILD_GID="$gid"
_ARGUS_BUILD_USER_LOADED=1
}
load_build_user() {
local profile="${ARGUS_BUILD_PROFILE:-default}"
load_build_user_profile "$profile"
}
argus_build_user_args() {
load_build_user
printf '%s' "--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} --build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID}"

View File

@ -3,4 +3,3 @@ build/
__pycache__/
.env
dist/

View File

@ -34,18 +34,6 @@ Agent 不再依赖配置文件;所有参数均由环境变量与主机名推
| `MASTER_ENDPOINT` | 是 | N/A | Master 基础地址,可写 `http://host:3000``host:3000`(自动补全 `http://`)。 |
| `REPORT_INTERVAL_SECONDS` | 否 | `60` | 状态上报间隔(秒)。必须为正整数。 |
| `AGENT_HOSTNAME` | 否 | `$(hostname)` | 覆盖容器内主机名,便于测试或特殊命名需求。 |
| `AGENT_ENV` | 否 | 来源于主机名 | 运行环境标识(如 `dev``prod`)。与 `AGENT_USER``AGENT_INSTANCE` 必须同时设置。 |
| `AGENT_USER` | 否 | 来源于主机名 | 归属用户或团队标识。与 `AGENT_ENV``AGENT_INSTANCE` 必须同时设置。 |
| `AGENT_INSTANCE` | 否 | 来源于主机名 | 实例编号或别名。与 `AGENT_ENV``AGENT_USER` 必须同时设置。 |
主机名与元数据的解析优先级:
1. 若设置 `AGENT_ENV` / `AGENT_USER` / `AGENT_INSTANCE` 且全部存在,则直接使用这些值。
2. 否则检查历史 `node.json`(注册成功后由 Master 返回的信息),若包含 `env` / `user` / `instance` 则沿用。
3. 若以上均不可用,则按历史约定从主机名解析 `env-user-instance` 前缀。
4. 如果仍无法得到完整结果Agent 启动会失败并提示需要提供上述环境变量。
> 提示在首次部署时需确保环境变量或主机名能够提供完整信息。完成注册后Agent 会把 Master 返回的元数据写入 `node.json`,后续重启无需再次提供环境变量就能保持一致性。
派生路径:

View File

@ -4,7 +4,6 @@ import os
import re
import socket
import subprocess
import ipaddress
from pathlib import Path
from typing import Any, Dict
@ -17,50 +16,15 @@ _HOSTNAME_PATTERN = re.compile(r"^([^-]+)-([^-]+)-([^-]+)-.*$")
def collect_metadata(config: AgentConfig) -> Dict[str, Any]:
"""汇总节点注册需要的静态信息,带有更智能的 IP 选择。
规则从高到低
1) AGENT_PUBLISH_IP 指定
2) Hostname A 记录若命中优先网段
3) 网卡扫描排除 AGENT_EXCLUDE_IFACES优先 AGENT_PREFER_NET_CIDRS
4) 默认路由回退UDP socket 技巧
额外发布overlay_ip / gwbridge_ip / interfaces便于 Master 与诊断使用
"""
"""汇总节点注册需要的静态信息。"""
hostname = config.hostname
prefer_cidrs = _read_cidrs_env(
os.environ.get("AGENT_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16")
)
exclude_ifaces = _read_csv_env(
os.environ.get("AGENT_EXCLUDE_IFACES", "docker_gwbridge,lo")
)
# interface inventory
interfaces = _list_global_ipv4_addrs()
if exclude_ifaces:
interfaces = [it for it in interfaces if it[0] not in set(exclude_ifaces)]
# resolve hostname candidates
host_ips = _resolve_hostname_ips(hostname)
selected_ip, overlay_ip, gwbridge_ip = _select_publish_ips(
interfaces=interfaces,
host_ips=host_ips,
prefer_cidrs=prefer_cidrs,
)
meta: Dict[str, Any] = {
env, user, instance = _parse_hostname(hostname)
meta = {
"hostname": hostname,
"ip": os.environ.get("AGENT_PUBLISH_IP", selected_ip), # keep required field
"overlay_ip": overlay_ip or selected_ip,
"gwbridge_ip": gwbridge_ip,
"interfaces": [
{"iface": name, "ip": ip} for name, ip in interfaces
],
"env": config.environment,
"user": config.user,
"instance": config.instance,
"ip": _detect_ip_address(),
"env": env,
"user": user,
"instance": instance,
"cpu_number": _detect_cpu_count(),
"memory_in_bytes": _detect_memory_bytes(),
"gpu_number": _detect_gpu_count(),
@ -133,7 +97,7 @@ def _detect_gpu_count() -> int:
def _detect_ip_address() -> str:
"""保留旧接口,作为最终回退:默认路由源地址 → 主机名解析 → 127.0.0.1"""
"""尝试通过 UDP socket 获得容器出口 IP失败则回退解析主机名"""
try:
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
sock.connect(("8.8.8.8", 80))
@ -145,118 +109,3 @@ def _detect_ip_address() -> str:
except OSError:
LOGGER.warning("Unable to resolve hostname to IP; defaulting to 127.0.0.1")
return "127.0.0.1"
def _read_csv_env(raw: str | None) -> list[str]:
if not raw:
return []
return [x.strip() for x in raw.split(",") if x.strip()]
def _read_cidrs_env(raw: str | None) -> list[ipaddress.IPv4Network]:
cidrs: list[ipaddress.IPv4Network] = []
for item in _read_csv_env(raw):
try:
net = ipaddress.ip_network(item, strict=False)
if isinstance(net, (ipaddress.IPv4Network,)):
cidrs.append(net)
except ValueError:
LOGGER.warning("Ignoring invalid CIDR in AGENT_PREFER_NET_CIDRS", extra={"cidr": item})
return cidrs
def _list_global_ipv4_addrs() -> list[tuple[str, str]]:
"""列出 (iface, ip) 形式的全局 IPv4 地址。
依赖 iproute2ip -4 -o addr show scope global
"""
results: list[tuple[str, str]] = []
try:
proc = subprocess.run(
["sh", "-lc", "ip -4 -o addr show scope global | awk '{print $2, $4}'"],
check=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=3,
)
if proc.returncode == 0:
for line in proc.stdout.splitlines():
line = line.strip()
if not line:
continue
parts = line.split()
if len(parts) != 2:
continue
iface, cidr = parts
ip = cidr.split("/")[0]
try:
ipaddress.IPv4Address(ip)
except ValueError:
continue
results.append((iface, ip))
except Exception as exc: # pragma: no cover - defensive
LOGGER.debug("Failed to list interfaces", extra={"error": str(exc)})
return results
def _resolve_hostname_ips(name: str) -> list[str]:
ips: list[str] = []
try:
infos = socket.getaddrinfo(name, None, family=socket.AF_INET)
for info in infos:
ip = info[4][0]
if ip not in ips:
ips.append(ip)
except OSError:
pass
return ips
def _pick_by_cidrs(candidates: list[str], prefer_cidrs: list[ipaddress.IPv4Network]) -> str | None:
for net in prefer_cidrs:
for ip in candidates:
try:
if ipaddress.ip_address(ip) in net:
return ip
except ValueError:
continue
return None
def _select_publish_ips(
*,
interfaces: list[tuple[str, str]],
host_ips: list[str],
prefer_cidrs: list[ipaddress.IPv4Network],
) -> tuple[str, str | None, str | None]:
"""返回 (selected_ip, overlay_ip, gwbridge_ip)。
- overlay_ip优先命中 prefer_cidrs10.0/8 先于 172.31/16
- gwbridge_ip若存在 172.22/16 则记录
- selected_ip优先 AGENT_PUBLISH_IP否则 overlay_ip否则 hostname A 记录中的 prefer否则默认路由回退
"""
# detect gwbridge (172.22/16)
gwbridge_net = ipaddress.ip_network("172.22.0.0/16")
gwbridge_ip = None
for _, ip in interfaces:
try:
if ipaddress.ip_address(ip) in gwbridge_net:
gwbridge_ip = ip
break
except ValueError:
continue
# overlay candidate from interfaces by prefer cidrs
iface_ips = [ip for _, ip in interfaces]
overlay_ip = _pick_by_cidrs(iface_ips, prefer_cidrs)
# hostname A records filtered by prefer cidrs
host_pref = _pick_by_cidrs(host_ips, prefer_cidrs)
env_ip = os.environ.get("AGENT_PUBLISH_IP")
if env_ip:
selected = env_ip
else:
selected = overlay_ip or host_pref or _detect_ip_address()
return selected, overlay_ip, gwbridge_ip

View File

@ -6,21 +6,14 @@ from dataclasses import dataclass
from pathlib import Path
from typing import Final
from .state import load_node_state
from .version import VERSION
from .log import get_logger
DEFAULT_REPORT_INTERVAL_SECONDS: Final[int] = 60
LOGGER = get_logger("argus.agent.config")
@dataclass(frozen=True)
class AgentConfig:
hostname: str
environment: str
user: str
instance: str
node_file: str
version: str
master_endpoint: str
@ -54,68 +47,11 @@ def _resolve_hostname() -> str:
return os.environ.get("AGENT_HOSTNAME") or socket.gethostname()
def _load_metadata_from_state(node_file: str) -> tuple[str, str, str] | None:
state = load_node_state(node_file)
if not state:
return None
meta = state.get("meta_data") or {}
env = meta.get("env") or state.get("env")
user = meta.get("user") or state.get("user")
instance = meta.get("instance") or state.get("instance")
if env and user and instance:
LOGGER.debug("Metadata resolved from node state", extra={"node_file": node_file})
return env, user, instance
LOGGER.warning(
"node.json missing metadata fields; ignoring",
extra={"node_file": node_file, "meta_data": meta},
)
return None
def _resolve_metadata_fields(hostname: str, node_file: str) -> tuple[str, str, str]:
env = os.environ.get("AGENT_ENV")
user = os.environ.get("AGENT_USER")
instance = os.environ.get("AGENT_INSTANCE")
if env and user and instance:
return env, user, instance
if any([env, user, instance]):
LOGGER.warning(
"Incomplete metadata environment variables; falling back to persisted metadata",
extra={
"has_env": bool(env),
"has_user": bool(user),
"has_instance": bool(instance),
},
)
state_metadata = _load_metadata_from_state(node_file)
if state_metadata is not None:
return state_metadata
from .collector import _parse_hostname # Local import to avoid circular dependency
env, user, instance = _parse_hostname(hostname)
if not all([env, user, instance]):
raise ValueError(
"Failed to determine metadata fields; set AGENT_ENV/USER/INSTANCE or use supported hostname pattern"
)
return env, user, instance
def load_config() -> AgentConfig:
"""从环境变量推导配置,移除了外部配置文件依赖。"""
hostname = _resolve_hostname()
node_file = f"/private/argus/agent/{hostname}/node.json"
environment, user, instance = _resolve_metadata_fields(hostname, node_file)
health_dir = f"/private/argus/agent/{hostname}/health/"
master_endpoint_env = os.environ.get("MASTER_ENDPOINT")
@ -130,9 +66,6 @@ def load_config() -> AgentConfig:
return AgentConfig(
hostname=hostname,
environment=environment,
user=user,
instance=instance,
node_file=node_file,
version=VERSION,
master_endpoint=master_endpoint,

BIN
src/agent/dist/argus-agent vendored Executable file

Binary file not shown.

View File

@ -12,8 +12,6 @@ VENV_DIR="$BUILD_ROOT/venv"
AGENT_BUILD_IMAGE="${AGENT_BUILD_IMAGE:-python:3.11-slim-bullseye}"
AGENT_BUILD_USE_DOCKER="${AGENT_BUILD_USE_DOCKER:-1}"
# 默认在容器内忽略代理以避免公司内网代理在 Docker 网络不可达导致 pip 失败(可用 0 关闭)
AGENT_BUILD_IGNORE_PROXY="${AGENT_BUILD_IGNORE_PROXY:-1}"
USED_DOCKER=0
run_host_build() {
@ -73,7 +71,6 @@ run_docker_build() {
pass_env_if_set http_proxy
pass_env_if_set https_proxy
pass_env_if_set no_proxy
pass_env_if_set AGENT_BUILD_IGNORE_PROXY
build_script=$(cat <<'INNER'
set -euo pipefail
@ -85,10 +82,6 @@ rm -rf build dist
mkdir -p build/pyinstaller dist
python3 -m venv --copies build/venv
source build/venv/bin/activate
# 若指定忽略代理,则清空常见代理与 pip 镜像环境变量,避免容器内代理不可达
if [ "${AGENT_BUILD_IGNORE_PROXY:-1}" = "1" ]; then
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY PIP_INDEX_URL PIP_EXTRA_INDEX_URL PIP_TRUSTED_HOST
fi
pip install --upgrade pip
pip install .
pip install pyinstaller==6.6.0

View File

@ -60,36 +60,6 @@ services:
ipv4_address: 172.28.0.20
restart: always
agent_env:
image: ubuntu:22.04
container_name: argus-agent-env-e2e
hostname: host_abc
depends_on:
- master
- bind
environment:
- MASTER_ENDPOINT=http://master.argus.com:3000
- REPORT_INTERVAL_SECONDS=2
- AGENT_ENV=prod
- AGENT_USER=ml
- AGENT_INSTANCE=node-3
- AGENT_HOSTNAME=host_abc
- "ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}"
- "ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}"
volumes:
- ./private/argus/agent/host_abc:/private/argus/agent/host_abc
- ./private/argus/agent/host_abc/health:/private/argus/agent/host_abc/health
- ./private/argus/etc:/private/argus/etc
- ../dist/argus-agent:/usr/local/bin/argus-agent:ro
- ./scripts/agent_entrypoint.sh:/usr/local/bin/agent-entrypoint.sh:ro
- ../scripts/agent_deployment_verify.sh:/usr/local/bin/agent_deployment_verify.sh:ro
entrypoint:
- /usr/local/bin/agent-entrypoint.sh
networks:
default:
ipv4_address: 172.28.0.21
restart: always
networks:
default:
driver: bridge

View File

@ -7,10 +7,10 @@ SCRIPTS=(
"02_up.sh"
"03_wait_and_assert_registration.sh"
"04_write_health_files.sh"
"05_verify_agent.sh"
"06_assert_status_on_master.sh"
"07_restart_agent_and_reregister.sh"
"08_down.sh"
"08_verify_agent.sh"
"05_assert_status_on_master.sh"
"06_restart_agent_and_reregister.sh"
"07_down.sh"
)
for script in "${SCRIPTS[@]}"; do

View File

@ -41,7 +41,7 @@ compose() {
fi
}
docker container rm -f argus-agent-e2e argus-agent-env-e2e argus-master-agent-e2e argus-bind-agent-e2e >/dev/null 2>&1 || true
docker container rm -f argus-agent-e2e argus-master-agent-e2e argus-bind-agent-e2e >/dev/null 2>&1 || true
docker network rm tests_default >/dev/null 2>&1 || true

View File

@ -6,14 +6,11 @@ TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
TMP_ROOT="$TEST_ROOT/tmp"
API_BASE="http://localhost:32300/api/v1/master"
AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
ENV_AGENT_HOSTNAME="host_abc"
NODE_FILE="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME/node.json"
ENV_NODE_FILE="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME/node.json"
mkdir -p "$TMP_ROOT"
primary_node_id=""
env_node_id=""
node_id=""
for _ in {1..30}; do
sleep 2
response=$(curl -sS "$API_BASE/nodes" || true)
@ -22,49 +19,24 @@ for _ in {1..30}; do
fi
list_file="$TMP_ROOT/nodes_list.json"
echo "$response" > "$list_file"
readarray -t node_ids < <(python3 - "$list_file" "$AGENT_HOSTNAME" "$ENV_AGENT_HOSTNAME" <<'PY'
node_id=$(python3 - "$list_file" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
nodes = json.load(handle)
target_primary = sys.argv[2]
target_env = sys.argv[3]
primary_id = ""
env_id = ""
for node in nodes:
if node.get("name") == target_primary:
primary_id = node.get("id", "")
if node.get("name") == target_env:
env_id = node.get("id", "")
print(primary_id)
print(env_id)
print(nodes[0]["id"] if nodes else "")
PY
)
primary_node_id="${node_ids[0]}"
env_node_id="${node_ids[1]}"
if [[ -n "$primary_node_id" && -n "$env_node_id" ]]; then
)
if [[ -n "$node_id" ]]; then
break
fi
done
if [[ -z "$primary_node_id" ]]; then
echo "[ERROR] Primary agent did not register within timeout" >&2
if [[ -z "$node_id" ]]; then
echo "[ERROR] Agent did not register within timeout" >&2
exit 1
fi
if [[ -z "$env_node_id" ]]; then
echo "[ERROR] Env-variable agent did not register within timeout" >&2
exit 1
fi
echo "$primary_node_id" > "$TMP_ROOT/node_id"
echo "$env_node_id" > "$TMP_ROOT/node_id_host_abc"
echo "$node_id" > "$TMP_ROOT/node_id"
if [[ ! -f "$NODE_FILE" ]]; then
echo "[ERROR] node.json not created at $NODE_FILE" >&2
@ -78,20 +50,8 @@ with open(sys.argv[1]) as handle:
assert "id" in node and node["id"], "node.json missing id"
PY
if [[ ! -f "$ENV_NODE_FILE" ]]; then
echo "[ERROR] node.json not created at $ENV_NODE_FILE" >&2
exit 1
fi
python3 - "$ENV_NODE_FILE" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
node = json.load(handle)
assert "id" in node and node["id"], "env agent node.json missing id"
PY
detail_file="$TMP_ROOT/initial_detail.json"
curl -sS "$API_BASE/nodes/$primary_node_id" -o "$detail_file"
curl -sS "$API_BASE/nodes/$node_id" -o "$detail_file"
python3 - "$detail_file" "$TMP_ROOT/initial_ip" <<'PY'
import json, sys, pathlib
with open(sys.argv[1]) as handle:
@ -102,5 +62,4 @@ if not ip:
pathlib.Path(sys.argv[2]).write_text(ip)
PY
echo "[INFO] Agent registered with node id $primary_node_id"
echo "[INFO] Env-variable agent registered with node id $env_node_id"
echo "[INFO] Agent registered with node id $node_id"

View File

@ -6,8 +6,6 @@ TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
TMP_ROOT="$TEST_ROOT/tmp"
API_BASE="http://localhost:32300/api/v1/master"
NODE_ID="$(cat "$TMP_ROOT/node_id")"
ENV_NODE_ID="$(cat "$TMP_ROOT/node_id_host_abc")"
ENV_HOSTNAME="host_abc"
NODES_JSON="$TEST_ROOT/private/argus/metric/prometheus/nodes.json"
success=false
@ -43,36 +41,13 @@ if [[ ! -f "$NODES_JSON" ]]; then
exit 1
fi
python3 - "$NODES_JSON" "$NODE_ID" "$ENV_NODE_ID" <<'PY'
python3 - "$NODES_JSON" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
nodes = json.load(handle)
expected_primary = sys.argv[2]
expected_env = sys.argv[3]
ids = {entry.get("node_id") for entry in nodes}
assert expected_primary in ids, nodes
assert expected_env in ids, nodes
assert len(nodes) >= 2, nodes
assert len(nodes) == 1, nodes
entry = nodes[0]
assert entry["node_id"], entry
PY
echo "[INFO] Master reflects agent health and nodes.json entries"
env_detail_file="$TMP_ROOT/env_agent_detail.json"
curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_detail_file"
python3 - "$env_detail_file" "$ENV_HOSTNAME" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
node = json.load(handle)
expected_name = sys.argv[2]
assert node.get("name") == expected_name, node
meta = node.get("meta_data", {})
assert meta.get("env") == "prod", meta
assert meta.get("user") == "ml", meta
assert meta.get("instance") == "node-3", meta
PY
echo "[INFO] Env-variable agent reports expected metadata"

View File

@ -1,60 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
REPO_ROOT="$(cd "$TEST_ROOT/.." && pwd)"
VERIFY_SCRIPT="$REPO_ROOT/scripts/agent_deployment_verify.sh"
ENV_NODE_ID_FILE="$TEST_ROOT/tmp/node_id_host_abc"
PRIMARY_CONTAINER="argus-agent-e2e"
ENV_CONTAINER="argus-agent-env-e2e"
PRIMARY_HOSTNAME="dev-e2euser-e2einst-pod-0"
ENV_HOSTNAME="host_abc"
if ! docker ps --format '{{.Names}}' | grep -q "^${PRIMARY_CONTAINER}$"; then
echo "[WARN] agent container not running; skip verification"
exit 0
fi
if docker exec -i "$PRIMARY_CONTAINER" bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
echo "[INFO] curl/jq already installed in agent container"
else
echo "[INFO] Installing curl/jq in agent container"
docker exec -i "$PRIMARY_CONTAINER" bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
fi
if [[ ! -f "$VERIFY_SCRIPT" ]]; then
echo "[ERROR] Verification script missing at $VERIFY_SCRIPT" >&2
exit 1
fi
run_verifier() {
local container="$1" hostname="$2"
if ! docker ps --format '{{.Names}}' | grep -q "^${container}$"; then
echo "[WARN] container $container not running; skip"
return
fi
if ! docker exec -i "$container" bash -lc 'command -v /usr/local/bin/agent_deployment_verify.sh >/dev/null'; then
echo "[ERROR] /usr/local/bin/agent_deployment_verify.sh missing in $container" >&2
exit 1
fi
echo "[INFO] Running verification for $hostname in $container"
docker exec -i "$container" env VERIFY_HOSTNAME="$hostname" /usr/local/bin/agent_deployment_verify.sh
}
run_verifier "$PRIMARY_CONTAINER" "$PRIMARY_HOSTNAME"
if docker ps --format '{{.Names}}' | grep -q "^${ENV_CONTAINER}$"; then
if docker exec -i "$ENV_CONTAINER" bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
echo "[INFO] curl/jq already installed in env agent container"
else
echo "[INFO] Installing curl/jq in env agent container"
docker exec -i "$ENV_CONTAINER" bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
fi
run_verifier "$ENV_CONTAINER" "$ENV_HOSTNAME"
else
echo "[WARN] env-driven agent container not running; skip secondary verification"
fi

View File

@ -6,20 +6,10 @@ TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
TMP_ROOT="$TEST_ROOT/tmp"
API_BASE="http://localhost:32300/api/v1/master"
NODE_ID="$(cat "$TMP_ROOT/node_id")"
ENV_NODE_ID_FILE="$TMP_ROOT/node_id_host_abc"
if [[ ! -f "$ENV_NODE_ID_FILE" ]]; then
echo "[ERROR] Env agent node id file missing at $ENV_NODE_ID_FILE" >&2
exit 1
fi
ENV_NODE_ID="$(cat "$ENV_NODE_ID_FILE")"
AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
ENV_AGENT_HOSTNAME="host_abc"
NETWORK_NAME="tests_default"
NEW_AGENT_IP="172.28.0.200"
NEW_ENV_AGENT_IP="172.28.0.210"
ENTRYPOINT_SCRIPT="$SCRIPT_DIR/agent_entrypoint.sh"
VERIFY_SCRIPT="$TEST_ROOT/../scripts/agent_deployment_verify.sh"
ENV_FILE="$TEST_ROOT/.env"
# 中文提示:重启场景也需要同样的入口脚本,确保 DNS 注册逻辑一致
@ -28,11 +18,6 @@ if [[ ! -f "$ENTRYPOINT_SCRIPT" ]]; then
exit 1
fi
if [[ ! -f "$VERIFY_SCRIPT" ]]; then
echo "[ERROR] agent verification script missing at $VERIFY_SCRIPT" >&2
exit 1
fi
if [[ ! -f "$TMP_ROOT/agent_binary_path" ]]; then
echo "[ERROR] Agent binary path missing; rerun bootstrap" >&2
exit 1
@ -89,37 +74,15 @@ if [[ "$prev_ip" != "$initial_ip" ]]; then
exit 1
fi
env_before_file="$TMP_ROOT/env_before_restart.json"
curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_before_file"
env_prev_last_updated=$(python3 - "$env_before_file" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
node = json.load(handle)
print(node.get("last_updated", ""))
PY
)
env_prev_ip=$(python3 - "$env_before_file" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
node = json.load(handle)
print(node["meta_data"].get("ip", ""))
PY
)
pushd "$TEST_ROOT" >/dev/null
compose rm -sf agent
compose rm -sf agent_env
popd >/dev/null
docker container rm -f argus-agent-e2e >/dev/null 2>&1 || true
docker container rm -f argus-agent-env-e2e >/dev/null 2>&1 || true
AGENT_DIR="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME"
HEALTH_DIR="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME/health"
ENV_AGENT_DIR="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME"
ENV_HEALTH_DIR="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME/health"
# 先以 sleep 方式启动容器,确保我们掌握注册时的网络状态
if ! docker run -d \
--name argus-agent-e2e \
@ -131,7 +94,6 @@ if ! docker run -d \
-v "$TEST_ROOT/private/argus/etc:/private/argus/etc" \
-v "$AGENT_BINARY:/usr/local/bin/argus-agent:ro" \
-v "$ENTRYPOINT_SCRIPT:/usr/local/bin/agent-entrypoint.sh:ro" \
-v "$VERIFY_SCRIPT:/usr/local/bin/agent_deployment_verify.sh:ro" \
-e MASTER_ENDPOINT=http://master.argus.com:3000 \
-e REPORT_INTERVAL_SECONDS=2 \
-e ARGUS_BUILD_UID="$AGENT_UID" \
@ -179,76 +141,3 @@ if [[ "$success" != true ]]; then
fi
echo "[INFO] Agent restart produced successful re-registration with IP change"
# ---- Restart env-driven agent without metadata environment variables ----
if [[ ! -d "$ENV_AGENT_DIR" ]]; then
echo "[ERROR] Env agent data dir missing at $ENV_AGENT_DIR" >&2
exit 1
fi
if [[ ! -d "$ENV_HEALTH_DIR" ]]; then
mkdir -p "$ENV_HEALTH_DIR"
fi
if ! docker run -d \
--name argus-agent-env-e2e \
--hostname "$ENV_AGENT_HOSTNAME" \
--network "$NETWORK_NAME" \
--ip "$NEW_ENV_AGENT_IP" \
-v "$ENV_AGENT_DIR:/private/argus/agent/$ENV_AGENT_HOSTNAME" \
-v "$ENV_HEALTH_DIR:/private/argus/agent/$ENV_AGENT_HOSTNAME/health" \
-v "$TEST_ROOT/private/argus/etc:/private/argus/etc" \
-v "$AGENT_BINARY:/usr/local/bin/argus-agent:ro" \
-v "$ENTRYPOINT_SCRIPT:/usr/local/bin/agent-entrypoint.sh:ro" \
-v "$VERIFY_SCRIPT:/usr/local/bin/agent_deployment_verify.sh:ro" \
-e MASTER_ENDPOINT=http://master.argus.com:3000 \
-e REPORT_INTERVAL_SECONDS=2 \
-e ARGUS_BUILD_UID="$AGENT_UID" \
-e ARGUS_BUILD_GID="$AGENT_GID" \
--entrypoint /usr/local/bin/agent-entrypoint.sh \
ubuntu:22.04 >/dev/null; then
echo "[ERROR] Failed to start env-driven agent container without metadata env" >&2
exit 1
fi
env_success=false
env_detail_file="$TMP_ROOT/env_post_restart.json"
for _ in {1..20}; do
sleep 3
if ! curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_detail_file"; then
continue
fi
if python3 - "$env_detail_file" "$env_prev_last_updated" "$ENV_NODE_ID" "$env_prev_ip" "$NEW_ENV_AGENT_IP" <<'PY'
import json, sys
with open(sys.argv[1]) as handle:
node = json.load(handle)
prev_last_updated = sys.argv[2]
expected_id = sys.argv[3]
old_ip = sys.argv[4]
expected_ip = sys.argv[5]
last_updated = node.get("last_updated")
current_ip = node["meta_data"].get("ip")
meta = node.get("meta_data", {})
assert node["id"] == expected_id
if current_ip != expected_ip:
raise SystemExit(1)
if current_ip == old_ip:
raise SystemExit(1)
if not last_updated or last_updated == prev_last_updated:
raise SystemExit(1)
if meta.get("env") != "prod" or meta.get("user") != "ml" or meta.get("instance") != "node-3":
raise SystemExit(1)
PY
then
env_success=true
break
fi
done
if [[ "$env_success" != true ]]; then
echo "[ERROR] Env-driven agent did not reuse persisted metadata after restart" >&2
exit 1
fi
echo "[INFO] Env-driven agent restart succeeded with persisted metadata"

View File

@ -13,7 +13,7 @@ compose() {
fi
}
docker container rm -f argus-agent-e2e argus-agent-env-e2e >/dev/null 2>&1 || true
docker container rm -f argus-agent-e2e >/dev/null 2>&1 || true
pushd "$TEST_ROOT" >/dev/null
compose down --remove-orphans

View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
VERIFY_SCRIPT="$(cd "$TEST_ROOT/.." && pwd)/scripts/agent_deployment_verify.sh"
if ! docker ps --format '{{.Names}}' | grep -q '^argus-agent-e2e$'; then
echo "[WARN] agent container not running; skip verification"
exit 0
fi
if docker exec -i argus-agent-e2e bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
echo "[INFO] curl/jq already installed in agent container"
else
echo "[INFO] Installing curl/jq in agent container"
docker exec -i argus-agent-e2e bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
fi
if docker exec -i argus-agent-e2e bash -lc 'command -v /usr/local/bin/agent_deployment_verify.sh >/dev/null'; then
docker exec -i argus-agent-e2e /usr/local/bin/agent_deployment_verify.sh
elif [[ -x "$VERIFY_SCRIPT" ]]; then
docker exec -i argus-agent-e2e "$VERIFY_SCRIPT"
else
echo "[WARN] agent_deployment_verify.sh not found"
fi

View File

@ -1,151 +0,0 @@
from __future__ import annotations
import os
import unittest
from contextlib import contextmanager
from unittest.mock import patch
from app.config import AgentConfig, load_config
@contextmanager
def temp_env(**overrides: str | None):
originals: dict[str, str | None] = {}
try:
for key, value in overrides.items():
originals[key] = os.environ.get(key)
if value is None:
os.environ.pop(key, None)
else:
os.environ[key] = value
yield
finally:
for key, original in originals.items():
if original is None:
os.environ.pop(key, None)
else:
os.environ[key] = original
class LoadConfigMetadataTests(unittest.TestCase):
@patch("app.config.Path.mkdir")
def test_metadata_from_environment_variables(self, mock_mkdir):
with temp_env(
MASTER_ENDPOINT="http://master.local",
AGENT_HOSTNAME="dev-user-one-pod",
AGENT_ENV="prod",
AGENT_USER="ops",
AGENT_INSTANCE="node-1",
):
config = load_config()
self.assertEqual(config.environment, "prod")
self.assertEqual(config.user, "ops")
self.assertEqual(config.instance, "node-1")
mock_mkdir.assert_called()
@patch("app.config.Path.mkdir")
def test_metadata_falls_back_to_hostname(self, mock_mkdir):
with temp_env(
MASTER_ENDPOINT="http://master.local",
AGENT_HOSTNAME="qa-team-abc-pod-2",
AGENT_ENV=None,
AGENT_USER=None,
AGENT_INSTANCE=None,
):
config = load_config()
self.assertEqual(config.environment, "qa")
self.assertEqual(config.user, "team")
self.assertEqual(config.instance, "abc")
mock_mkdir.assert_called()
@patch("app.config._load_metadata_from_state", return_value=("prod", "ops", "node-1"))
@patch("app.config.Path.mkdir")
def test_metadata_from_node_state(self, mock_mkdir, mock_state):
with temp_env(
MASTER_ENDPOINT="http://master.local",
AGENT_HOSTNAME="host_abc",
AGENT_ENV=None,
AGENT_USER=None,
AGENT_INSTANCE=None,
):
config = load_config()
self.assertEqual(config.environment, "prod")
self.assertEqual(config.user, "ops")
self.assertEqual(config.instance, "node-1")
mock_state.assert_called_once()
mock_mkdir.assert_called()
@patch("app.config.Path.mkdir")
def test_partial_environment_variables_fallback(self, mock_mkdir):
with temp_env(
MASTER_ENDPOINT="http://master.local",
AGENT_HOSTNAME="stage-ml-001-node",
AGENT_ENV="prod",
AGENT_USER=None,
AGENT_INSTANCE=None,
):
config = load_config()
self.assertEqual(config.environment, "stage")
self.assertEqual(config.user, "ml")
self.assertEqual(config.instance, "001")
mock_mkdir.assert_called()
@patch("app.config.Path.mkdir")
def test_invalid_hostname_raises_error(self, mock_mkdir):
with temp_env(
MASTER_ENDPOINT="http://master.local",
AGENT_HOSTNAME="invalidhostname",
AGENT_ENV=None,
AGENT_USER=None,
AGENT_INSTANCE=None,
):
with self.assertRaises(ValueError):
load_config()
mock_mkdir.assert_not_called()
class CollectMetadataTests(unittest.TestCase):
@patch("app.collector._detect_ip_address", return_value="127.0.0.1")
@patch("app.collector._detect_gpu_count", return_value=0)
@patch("app.collector._detect_memory_bytes", return_value=1024)
@patch("app.collector._detect_cpu_count", return_value=8)
def test_collect_metadata_uses_config_fields(
self,
mock_cpu,
mock_memory,
mock_gpu,
mock_ip,
):
config = AgentConfig(
hostname="dev-user-001-pod",
environment="prod",
user="ops",
instance="node-1",
node_file="/tmp/node.json",
version="1.0.0",
master_endpoint="http://master.local",
report_interval_seconds=60,
health_dir="/tmp/health",
)
from app.collector import collect_metadata
metadata = collect_metadata(config)
self.assertEqual(metadata["env"], "prod")
self.assertEqual(metadata["user"], "ops")
self.assertEqual(metadata["instance"], "node-1")
self.assertEqual(metadata["hostname"], "dev-user-001-pod")
self.assertEqual(metadata["ip"], "127.0.0.1")
self.assertEqual(metadata["cpu_number"], 8)
self.assertEqual(metadata["memory_in_bytes"], 1024)
self.assertEqual(metadata["gpu_number"], 0)
if __name__ == "__main__":
unittest.main()

View File

@ -1,31 +0,0 @@
# Alertmanager
## 构建
1. 首先设置构建和部署的环境变量, 在项目根目录下执行:
```bash
cp src/alert/tests/.env.example src/alert/tests/.env
```
然后找到复制出来的.env文件修改环境变量。
2. 使用脚本构建,在项目根目录下执行:
```bash
bash src/alert/alertmanager/build/build.sh
```
构建成功后会在项目根目录下生成argus-alertmanager-latest.tar
## 部署
提供docker-compose部署。在src/alert/tests目录下
```bash
docker-compose up -d
```
## 动态配置
配置文件放在`/private/argus/alert/alertmanager/alertmanager.yml`修改alertmanager.yml后调用`http://alertmanager.alert.argus.com:9093/-/reload`接口(POST)可以重新加载配置.
```bash
curl -X POST http://localhost:9093/-/reload
```

View File

@ -1,101 +0,0 @@
# 基于 Ubuntu 24.04
FROM ubuntu:24.04
# 切换到 root 用户
USER root
# 安装必要依赖
RUN apt-get update && \
apt-get install -y wget supervisor net-tools inetutils-ping vim ca-certificates passwd && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# 设置 Alertmanager 版本(与本地离线包保持一致)
ARG ALERTMANAGER_VERSION=0.28.1
# 使用仓库内预置的离线包构建(无需联网)
COPY src/alert/alertmanager/build/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz /tmp/
RUN tar xvf /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz -C /tmp && \
mv /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64 /usr/local/alertmanager && \
rm -f /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
ENV ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
ARG ARGUS_BUILD_UID=2133
ARG ARGUS_BUILD_GID=2015
ENV ARGUS_BUILD_UID=${ARGUS_BUILD_UID}
ENV ARGUS_BUILD_GID=${ARGUS_BUILD_GID}
RUN mkdir -p /usr/share/alertmanager && \
mkdir -p ${ALERTMANAGER_BASE_PATH} && \
mkdir -p /private/argus/etc && \
rm -rf /alertmanager && \
ln -s ${ALERTMANAGER_BASE_PATH} /alertmanager
# 确保 ubuntu 账户存在并使用 ARGUS_BUILD_UID/GID
RUN set -eux; \
# 确保存在目标 GID 的组;若不存在则优先尝试将 ubuntu 组改为该 GID否则创建新组
if getent group "${ARGUS_BUILD_GID}" >/dev/null; then \
:; \
else \
if getent group ubuntu >/dev/null; then \
groupmod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
else \
groupadd -g "${ARGUS_BUILD_GID}" ubuntu || groupadd -g "${ARGUS_BUILD_GID}" argus || true; \
fi; \
fi; \
# 创建或调整 ubuntu 用户
if id ubuntu >/dev/null 2>&1; then \
# 设置主组为目标 GID可用 GID 数字指定)
usermod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
# 若目标 UID 未被占用,则更新 ubuntu 的 UID
if [ "$(id -u ubuntu)" != "${ARGUS_BUILD_UID}" ] && ! getent passwd "${ARGUS_BUILD_UID}" >/dev/null; then \
usermod -u "${ARGUS_BUILD_UID}" ubuntu || true; \
fi; \
else \
useradd -m -s /bin/bash -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" ubuntu || true; \
fi; \
# 调整关键目录属主为 ubuntu UID/GID
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /usr/share/alertmanager /alertmanager ${ALERTMANAGER_BASE_PATH} /private/argus/etc /usr/local/bin || true
# 配置内网 apt 源 (如果指定了内网选项)
RUN if [ "$USE_INTRANET" = "true" ]; then \
echo "Configuring intranet apt sources..." && \
cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
echo "deb [trusted=yes] http://10.68.64.1/ubuntu2204/ jammy main" > /etc/apt/sources.list && \
echo 'Acquire::https::Verify-Peer "false";' > /etc/apt/apt.conf.d/99disable-ssl-check && \
echo 'Acquire::https::Verify-Host "false";' >> /etc/apt/apt.conf.d/99disable-ssl-check; \
fi
# 配置部署时使用的 apt 源
RUN if [ "$USE_INTRANET" = "true" ]; then \
echo "deb [trusted=yes] https://10.92.132.52/mirrors/ubuntu2204/ jammy main" > /etc/apt/sources.list; \
fi
# 创建 supervisor 日志目录
RUN mkdir -p /var/log/supervisor
# 复制 supervisor 配置文件
COPY src/alert/alertmanager/build/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
# 复制启动脚本
COPY src/alert/alertmanager/build/start-am-supervised.sh /usr/local/bin/start-am-supervised.sh
RUN chmod +x /usr/local/bin/start-am-supervised.sh
# 复制 Alertmanager 配置文件
COPY src/alert/alertmanager/build/alertmanager.yml /etc/alertmanager/alertmanager.yml
RUN chmod +x /etc/alertmanager/alertmanager.yml
# COPY src/alert/alertmanager/build/alertmanager.yml ${ALERTMANAGER_BASE_PATH}/alertmanager.yml
# 复制 DNS 监控脚本
COPY src/alert/alertmanager/build/dns-monitor.sh /usr/local/bin/dns-monitor.sh
RUN chmod +x /usr/local/bin/dns-monitor.sh
# 保持 root 用户,由 supervisor 控制 user 切换
USER root
# 暴露端口Alertmanager 默认端口 9093
EXPOSE 9093
# 使用 supervisor 作为入口点
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

View File

@ -1,19 +0,0 @@
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance'] # 分组:相同 alertname + instance 的告警合并
group_wait: 30s # 第一个告警后,等 30s 看是否有同组告警一起发
group_interval: 5m # 同组告警变化后,至少 5 分钟再发一次
repeat_interval: 3h # 相同告警3 小时重复提醒一次
receiver: 'null'
receivers:
- name: 'null'
inhibit_rules:
- source_match:
severity: 'critical' # critical 告警存在时
target_match:
severity: 'warning' # 抑制相同 instance 的 warning 告警
equal: ['instance']

View File

@ -1,13 +0,0 @@
#!/bin/bash
set -euo pipefail
docker pull ubuntu:24.04
source src/alert/tests/.env
docker build \
--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} \
--build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID} \
-f src/alert/alertmanager/build/Dockerfile \
-t argus-alertmanager:latest .
docker save -o argus-alertmanager-latest.tar argus-alertmanager:latest

View File

@ -1,68 +0,0 @@
#!/bin/bash
# DNS监控脚本 - 每10秒检查dns.conf是否有变化
# 如果有变化则执行update-dns.sh脚本
DNS_CONF="/private/argus/etc/dns.conf"
DNS_BACKUP="/tmp/dns.conf.backup"
UPDATE_SCRIPT="/private/argus/etc/update-dns.sh"
LOG_FILE="/var/log/supervisor/dns-monitor.log"
# 确保日志文件存在
touch "$LOG_FILE"
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') [DNS-Monitor] $1" >> "$LOG_FILE"
}
log_message "DNS监控脚本启动"
while true; do
if [ -f "$DNS_CONF" ]; then
if [ -f "$DNS_BACKUP" ]; then
# 比较文件内容
if ! cmp -s "$DNS_CONF" "$DNS_BACKUP"; then
log_message "检测到DNS配置变化"
# 更新备份文件
cp "$DNS_CONF" "$DNS_BACKUP"
# 执行更新脚本
if [ -x "$UPDATE_SCRIPT" ]; then
log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
"$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
if [ $? -eq 0 ]; then
log_message "DNS更新脚本执行成功"
else
log_message "DNS更新脚本执行失败"
fi
else
log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
fi
fi
else
# 第一次检测到配置文件,执行更新脚本
if [ -x "$UPDATE_SCRIPT" ]; then
log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
"$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
if [ $? -eq 0 ]; then
log_message "DNS更新脚本执行成功"
# 第一次运行,创建备份并执行更新
cp "$DNS_CONF" "$DNS_BACKUP"
log_message "创建DNS配置备份文件"
else
log_message "DNS更新脚本执行失败"
fi
else
log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
fi
fi
else
log_message "警告: DNS配置文件不存在: $DNS_CONF"
fi
sleep 10
done

View File

@ -1,22 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# 下载 Alertmanager 离线安装包到本目录,用于 Docker 构建时 COPY
# 用法:
# ./fetch-dist.sh [version]
# 示例:
# ./fetch-dist.sh 0.28.1
VER="${1:-0.28.1}"
OUT="alertmanager-${VER}.linux-amd64.tar.gz"
URL="https://github.com/prometheus/alertmanager/releases/download/v${VER}/${OUT}"
if [[ -f "$OUT" ]]; then
echo "[INFO] $OUT already exists, skip download"
exit 0
fi
echo "[INFO] Downloading $URL"
curl -fL --retry 3 --connect-timeout 10 -o "$OUT" "$URL"
echo "[OK] Saved to $(pwd)/$OUT"

View File

@ -1,25 +0,0 @@
#!/bin/bash
set -euo pipefail
echo "[INFO] Starting Alertmanager under supervisor..."
ALERTMANAGER_BASE_PATH=${ALERTMANAGER_BASE_PATH:-/private/argus/alert/alertmanager}
echo "[INFO] Alertmanager base path: ${ALERTMANAGER_BASE_PATH}"
# 使用容器内的 /etc/alertmanager/alertmanager.yml 作为配置文件,避免写入挂载卷导致的权限问题
echo "[INFO] Using /etc/alertmanager/alertmanager.yml as configuration"
# 记录容器 IP 地址
DOMAIN=alertmanager.alert.argus.com
IP=$(ifconfig | grep -A 1 eth0 | grep inet | awk '{print $2}')
echo "current IP: ${IP}"
echo "${IP}" > /private/argus/etc/${DOMAIN}
chmod 755 /private/argus/etc/${DOMAIN}
echo "[INFO] Starting Alertmanager process..."
# 启动 Alertmanager 主进程
exec /usr/local/alertmanager/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager --cluster.listen-address=""

View File

@ -1,39 +0,0 @@
[supervisord]
nodaemon=true
logfile=/var/log/supervisor/supervisord.log
pidfile=/var/run/supervisord.pid
user=root
[program:alertmanager]
command=/usr/local/bin/start-am-supervised.sh
user=ubuntu
stdout_logfile=/var/log/supervisor/alertmanager.log
stderr_logfile=/var/log/supervisor/alertmanager_error.log
autorestart=true
startretries=3
startsecs=10
stopwaitsecs=20
killasgroup=true
stopasgroup=true
[program:dns-monitor]
command=/usr/local/bin/dns-monitor.sh
user=root
stdout_logfile=/var/log/supervisor/dns-monitor.log
stderr_logfile=/var/log/supervisor/dns-monitor_error.log
autorestart=true
startretries=3
startsecs=5
stopwaitsecs=10
killasgroup=true
stopasgroup=true
[unix_http_server]
file=/var/run/supervisor.sock
chmod=0700
[supervisorctl]
serverurl=unix:///var/run/supervisor.sock
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

View File

@ -1,60 +0,0 @@
# 告警配置
> 参考:[自定义Prometheus告警规则](https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule)
在Prometheus中配置告警的有两个步骤
1. 写告警规则文件rules文件
2. 在promethues.yml里加载规则并配置Alertmanager
## 1. 编写告警规则文件
告警规则如下:
```yml
groups:
- name: example-rules
interval: 30s # 每30秒评估一次
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 已宕机"
description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
- alert: HighCpuUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
```
其中:
- `alert`:告警规则的名称。
- `expr`基于PromQL表达式告警触发条件用于计算是否有时间序列满足该条件。
- `for`评估等待时间可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
- `labels`自定义标签允许用户指定要附加到告警上的一组附加标签可以在Alertmanager中做路由和分组。
- `annotations`用于指定一组附加信息比如用于描述告警详细信息的文字等annotations的内容在告警产生时会一同作为参数发送到Alertmanager。可以提供告警摘要和详细信息。
## 2. promothues.yml里引用
在prometheus.yml中加上`rule_files``alerting`:
```yml
global:
[ evaluation_interval: <duration> | default = 1m ]
rule_files:
[ - <filepath_glob> ... ]
alerting:
alertmanagers:
- static_configs:
- targets:
- "alertmanager.alert.argus.com:9093" # Alertmanager 地址
```

View File

@ -1,37 +0,0 @@
groups:
- name: example-rules
interval: 30s # 每30秒评估一次
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 已宕机"
description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
- alert: HighCpuUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "实例 {{ $labels.instance }} 内存使用率超过 80% 持续 5 分钟。"
- alert: DiskSpaceLow
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} - node_filesystem_free_bytes{fstype!~"tmpfs|overlay"}) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "实例 {{ $labels.instance }} 磁盘空间不足超过 90% 持续 10 分钟。"

View File

@ -1,5 +0,0 @@
DATA_ROOT=/home/argus/tmp/private/argus
ARGUS_BUILD_UID=1048
ARGUS_BUILD_GID=1048
USE_INTRANET=false

View File

@ -1,5 +0,0 @@
DATA_ROOT=/home/argus/tmp/private/argus
ARGUS_BUILD_UID=1048
ARGUS_BUILD_GID=1048
USE_INTRANET=false

View File

@ -1,37 +0,0 @@
services:
alertmanager:
build:
context: ../../../
dockerfile: src/alert/alertmanager/build/Dockerfile
args:
ARGUS_BUILD_UID: ${ARGUS_BUILD_UID:-2133}
ARGUS_BUILD_GID: ${ARGUS_BUILD_GID:-2015}
USE_INTRANET: ${USE_INTRANET:-false}
image: argus-alertmanager:latest
container_name: argus-alertmanager
environment:
- ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
ports:
- "${ARGUS_PORT:-9093}:9093"
volumes:
- ${DATA_ROOT:-./data}/alert/alertmanager:/private/argus/alert/alertmanager
- ${DATA_ROOT:-./data}/etc:/private/argus/etc
networks:
- argus-debug-net
restart: unless-stopped
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
networks:
argus-debug-net:
driver: bridge
name: argus-debug-net
volumes:
alertmanager_data:
driver: local

View File

@ -1,113 +0,0 @@
#!/bin/bash
# verify_alertmanager.sh
# 用于部署后验证 Prometheus 与 Alertmanager 通信链路是否正常
set -euo pipefail
#=============================
# 基础配置
#=============================
PROM_URL="${PROM_URL:-http://prom.metric.argus.com:9090}"
ALERT_URL="${ALERT_URL:-http://alertmanager.alert.argus.com:9093}"
# TODO: 根据实际部署环境调整规则目录
DATA_ROOT="${DATA_ROOT:-/private/argus}"
RULE_DIR = "$DATA_ROOT/metric/prometheus/rules"
TMP_RULE="/tmp/test_rule.yml"
#=============================
# 辅助函数
#=============================
GREEN="\033[32m"; RED="\033[31m"; YELLOW="\033[33m"; RESET="\033[0m"
log_info() { echo -e "${YELLOW}[INFO]${RESET} $1"; }
log_success() { echo -e "${GREEN}[OK]${RESET} $1"; }
log_error() { echo -e "${RED}[ERROR]${RESET} $1"; }
fail_exit() { log_error "$1"; exit 1; }
#=============================
# Step 1: 检查 Alertmanager 是否可访问
#=============================
log_info "检查 Alertmanager 状态..."
if curl -sSf "${ALERT_URL}/api/v2/status" >/dev/null 2>&1; then
log_success "Alertmanager 服务正常 (${ALERT_URL})"
else
fail_exit "无法访问 Alertmanager请检查端口映射与容器状态。"
fi
#=============================
# Step 2: 手动发送测试告警
#=============================
log_info "发送手动测试告警..."
curl -s -XPOST "${ALERT_URL}/api/v2/alerts" -H "Content-Type: application/json" -d '[
{
"labels": {
"alertname": "ManualTestAlert",
"severity": "info"
},
"annotations": {
"summary": "This is a test alert from deploy verification"
},
"startsAt": "'$(date -Iseconds)'"
}
]' >/dev/null && log_success "测试告警已成功发送到 Alertmanager"
#=============================
# Step 3: 检查 Prometheus 配置中是否包含 Alertmanager
#=============================
log_info "检查 Prometheus 是否配置了 Alertmanager..."
if curl -s "${PROM_URL}/api/v1/status/config" | grep -q "alertmanagers"; then
log_success "Prometheus 已配置 Alertmanager 目标"
else
fail_exit "Prometheus 未配置 Alertmanager请检查 prometheus.yml"
fi
#=============================
# Step 4: 创建并加载测试告警规则
#=============================
log_info "创建临时测试规则 ${TMP_RULE} ..."
cat <<EOF > "${TMP_RULE}"
groups:
- name: deploy-verify-group
rules:
- alert: DeployVerifyAlert
expr: vector(1)
labels:
severity: warning
annotations:
summary: "Deployment verification alert"
EOF
mkdir -p "${RULE_DIR}"
cp "${TMP_RULE}" "${RULE_DIR}/test_rule.yml"
log_info "重载 Prometheus 以加载新规则..."
if curl -s -X POST "${PROM_URL}/-/reload" >/dev/null; then
log_success "Prometheus 已重载规则"
else
fail_exit "Prometheus reload 失败,请检查 API 可访问性。"
fi
#=============================
# Step 5: 等待并验证 Alertmanager 是否收到告警
#=============================
log_info "等待告警触发 (约5秒)..."
sleep 5
if curl -s "${ALERT_URL}/api/v2/alerts" | grep -q "DeployVerifyAlert"; then
log_success "Prometheus → Alertmanager 告警链路验证成功"
else
fail_exit "未在 Alertmanager 中检测到 DeployVerifyAlert请检查网络或配置。"
fi
#=============================
# Step 6: 清理测试规则
#=============================
log_info "清理临时测试规则..."
rm -f "${RULE_DIR}/test_rule.yml" "${TMP_RULE}"
curl -s -X POST "${PROM_URL}/-/reload" >/dev/null \
&& log_success "Prometheus 已清理验证规则" \
|| log_error "Prometheus reload 清理失败,请手动确认。"
log_success "部署验证全部通过Prometheus ↔ Alertmanager 通信正常。"

View File

@ -26,7 +26,6 @@ RUN apt-get update && \
apt-get install -y \
bind9 \
bind9utils \
dnsutils \
bind9-doc \
supervisor \
net-tools \

View File

@ -17,9 +17,6 @@ log_message() {
log_message "DNS监控脚本启动"
log_message "删除DNS备份文件如果存在"
rm -f $DNS_BACKUP
while true; do
if [ -f "$DNS_CONF" ]; then
if [ -f "$DNS_BACKUP" ]; then

View File

@ -1 +0,0 @@
.build*/

View File

@ -1,33 +0,0 @@
FROM ubuntu:22.04
ARG ARGUS_BUILD_UID=2133
ARG ARGUS_BUILD_GID=2015
ENV DEBIAN_FRONTEND=noninteractive \
TZ=Asia/Shanghai \
ARGUS_LOGS_WORLD_WRITABLE=1
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata \
cron procps supervisor vim less tar gzip python3; \
rm -rf /var/lib/apt/lists/*; \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
WORKDIR /
# Offline fluent-bit assets and bundle tarball are staged by the build script
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
COPY private/etc /private/etc
COPY private/packages /private/packages
COPY bundle/ /bundle/
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
if [ "${ARGUS_LOGS_WORLD_WRITABLE}" = "1" ]; then chmod 1777 /logs/train /logs/infer || true; else chmod 755 /logs/train /logs/infer || true; fi; \
chmod 770 /buffers || true
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]

View File

@ -1,59 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# health-watcher.sh (CPU node bundle)
# 周期执行 check_health.sh 与 restart_unhealthy.sh用于节点容器内自愈。
INSTALL_ROOT="/opt/argus-metric"
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
VER_DIR="${1:-}"
log(){ echo "[HEALTH-WATCHER] $*"; }
resolve_ver_dir() {
local dir=""
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
dir="$VER_DIR"
elif [[ -L "$INSTALL_ROOT/current" ]]; then
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$dir" ]]; then
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
echo "$dir"
}
main() {
log "starting with interval=${INTERVAL}s"
local dir
dir="$(resolve_ver_dir)"
if [[ -z "$dir" || ! -d "$dir" ]]; then
log "no valid install dir found under $INSTALL_ROOT; exiting"
exit 0
fi
local chk="$dir/check_health.sh"
local rst="$dir/restart_unhealthy.sh"
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
exit 0
fi
log "watching install dir: $dir"
while :; do
if [[ -x "$chk" ]]; then
log "running check_health.sh"
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
fi
if [[ -x "$rst" ]]; then
log "running restart_unhealthy.sh"
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
fi
sleep "$INTERVAL"
done
}
main "$@"

View File

@ -1,131 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
echo "[BOOT] CPU node bundle starting"
INSTALL_ROOT="/opt/argus-metric"
BUNDLE_DIR="/bundle"
STATE_DIR_BASE="/private/argus/agent"
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
chmod 770 /buffers || true
installed_ok=0
# 1) already installed?
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
else
# 2) try local bundle first (argus-metric_*.tar.gz)
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
if [[ -n "${tarball:-}" ]]; then
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
tmp=$(mktemp -d)
tar -xzf "$tarball" -C "$tmp"
# locate root containing version.json
root="$tmp"
if [[ ! -f "$root/version.json" ]]; then
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
fi
if [[ ! -f "$root/version.json" ]]; then
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
else
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
if [[ -z "$ver" ]]; then
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
else
target_root="$INSTALL_ROOT"
version_dir="$target_root/versions/$ver"
mkdir -p "$version_dir"
shopt -s dotglob
mv "$root"/* "$version_dir/" 2>/dev/null || true
shopt -u dotglob
if [[ -f "$version_dir/install.sh" ]]; then
chmod +x "$version_dir/install.sh" 2>/dev/null || true
(
export AUTO_START_DCGM="0" # N/A on CPU
cd "$version_dir" && ./install.sh "$version_dir"
)
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
installed_ok=1
echo "[BOOT] local bundle install OK: version=$ver"
else
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
fi
else
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
fi
fi
fi
fi
# 3) fallback: use FTP setup if not installed
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
echo "[BOOT] fallback to FTP setup"
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
exit 1
fi
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
chmod +x /tmp/setup.sh
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
fi
fi
# 4) ensure argus-agent is running (best-effort)
if ! pgrep -x argus-agent >/dev/null 2>&1; then
echo "[BOOT] starting argus-agent (not detected)"
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
fi
# 5) post-install selfcheck and state
ver_dir=""
if [[ -L "$INSTALL_ROOT/current" ]]; then
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$ver_dir" ]]; then
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
else
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
fi
else
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
fi
host="$(hostname)"
state_dir="$STATE_DIR_BASE/${host}"
mkdir -p "$state_dir" 2>/dev/null || true
for i in {1..60}; do
if [[ -s "$state_dir/node.json" ]]; then
echo "[BOOT] node state present: $state_dir/node.json"
break
fi
sleep 2
done
# 6) spawn health watcher (best-effort, non-blocking)
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
echo "[BOOT] starting health watcher for $ver_dir"
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
else
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
fi
echo "[BOOT] ready; entering sleep"
exec sleep infinity

View File

@ -1 +0,0 @@
.build*/

View File

@ -1,44 +0,0 @@
ARG CUDA_VER=12.2.2
FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu22.04
ARG CLIENT_VER=0.0.0
ARG BUNDLE_DATE=00000000
LABEL org.opencontainers.image.title="argus-sys-metric-test-node-bundle-gpu" \
org.opencontainers.image.description="GPU node bundle with embedded Argus client artifact" \
org.opencontainers.image.version="${CLIENT_VER}" \
org.opencontainers.image.revision_date="${BUNDLE_DATE}" \
maintainer="Argus"
ENV DEBIAN_FRONTEND=noninteractive \
TZ=Asia/Shanghai \
ARGUS_LOGS_WORLD_WRITABLE=1 \
ES_HOST=es.log.argus.com \
ES_PORT=9200 \
CLUSTER=local \
RACK=dev
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata cron procps vim less \
tar gzip; \
rm -rf /var/lib/apt/lists/*; \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
WORKDIR /
# Expect staged build context to provide these directories/files
COPY bundle/ /bundle/
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
COPY private/etc /private/etc
COPY private/packages /private/packages
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
chmod 1777 /logs/train /logs/infer || true; \
chmod 770 /buffers || true
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]

View File

@ -1,59 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# health-watcher.sh (GPU bundle)
# 周期执行 check_health.sh 与 restart_unhealthy.sh用于 GPU 节点容器内自愈。
INSTALL_ROOT="/opt/argus-metric"
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
VER_DIR="${1:-}"
log(){ echo "[HEALTH-WATCHER] $*"; }
resolve_ver_dir() {
local dir=""
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
dir="$VER_DIR"
elif [[ -L "$INSTALL_ROOT/current" ]]; then
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$dir" ]]; then
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
echo "$dir"
}
main() {
log "starting with interval=${INTERVAL}s"
local dir
dir="$(resolve_ver_dir)"
if [[ -z "$dir" || ! -d "$dir" ]]; then
log "no valid install dir found under $INSTALL_ROOT; exiting"
exit 0
fi
local chk="$dir/check_health.sh"
local rst="$dir/restart_unhealthy.sh"
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
exit 0
fi
log "watching install dir: $dir"
while :; do
if [[ -x "$chk" ]]; then
log "running check_health.sh"
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
fi
if [[ -x "$rst" ]]; then
log "running restart_unhealthy.sh"
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
fi
sleep "$INTERVAL"
done
}
main "$@"

View File

@ -1,135 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
echo "[BOOT] GPU node bundle starting"
INSTALL_ROOT="/opt/argus-metric"
BUNDLE_DIR="/bundle"
STATE_DIR_BASE="/private/argus/agent"
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
chmod 770 /buffers || true
installed_ok=0
# 1) already installed?
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
else
# 2) try local bundle first (argus-metric_*.tar.gz)
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
if [[ -n "${tarball:-}" ]]; then
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
tmp=$(mktemp -d)
tar -xzf "$tarball" -C "$tmp"
# locate root containing version.json
root="$tmp"
if [[ ! -f "$root/version.json" ]]; then
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
fi
if [[ ! -f "$root/version.json" ]]; then
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
else
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
if [[ -z "$ver" ]]; then
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
else
target_root="$INSTALL_ROOT"
version_dir="$target_root/versions/$ver"
mkdir -p "$version_dir"
shopt -s dotglob
mv "$root"/* "$version_dir/" 2>/dev/null || true
shopt -u dotglob
if [[ -f "$version_dir/install.sh" ]]; then
chmod +x "$version_dir/install.sh" 2>/dev/null || true
(
export AUTO_START_DCGM="${AUTO_START_DCGM:-1}"
export DCGM_EXPORTER_DISABLE_PROFILING="${DCGM_EXPORTER_DISABLE_PROFILING:-1}"
export DCGM_EXPORTER_LISTEN="${DCGM_EXPORTER_LISTEN:-:9400}"
cd "$version_dir" && ./install.sh "$version_dir"
)
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
installed_ok=1
echo "[BOOT] local bundle install OK: version=$ver"
else
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
fi
else
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
fi
fi
fi
fi
# 3) fallback: use FTP setup if not installed
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
echo "[BOOT] fallback to FTP setup"
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
exit 1
fi
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
chmod +x /tmp/setup.sh
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
fi
fi
# 4) ensure argus-agent is running (best-effort)
if ! pgrep -x argus-agent >/dev/null 2>&1; then
echo "[BOOT] starting argus-agent (not detected)"
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
fi
# 5) post-install selfcheck (run once) and state
# prefer current version dir; fallback to first version under /opt/argus-metric/versions
ver_dir=""
if [[ -L "$INSTALL_ROOT/current" ]]; then
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
fi
if [[ -z "$ver_dir" ]]; then
# pick the latest by name (semver-like); best-effort
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
fi
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
else
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
fi
else
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
fi
host="$(hostname)"
state_dir="$STATE_DIR_BASE/${host}"
mkdir -p "$state_dir" 2>/dev/null || true
for i in {1..60}; do
if [[ -s "$state_dir/node.json" ]]; then
echo "[BOOT] node state present: $state_dir/node.json"
break
fi
sleep 2
done
# 6) spawn health watcher (best-effort, non-blocking)
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
echo "[BOOT] starting health watcher for $ver_dir"
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
else
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
fi
echo "[BOOT] ready; entering sleep"
exec sleep infinity

View File

@ -22,7 +22,8 @@
[PARSER]
Name timestamp_parser
Format regex
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:Z|[+-]\d{2}:?\d{2}))\s+(?<level>\w+)\s+(?<message>.*)$
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<message>.*)$
Time_Key timestamp
Time_Format %Y-%m-%dT%H:%M:%S%z
Time_Format %Y-%m-%d %H:%M:%S
Time_Offset +0800
Time_Keep On

View File

@ -1,109 +1,47 @@
#!/bin/bash
set -euo pipefail
echo "[INFO] Starting Fluent Bit setup in Ubuntu container (offline-first)..."
echo "[INFO] Starting Fluent Bit setup in Ubuntu container..."
# 安装必要的工具
echo "[INFO] Installing required packages..."
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
apt-get install -y -qq curl
# Stage bundle to /tmp (read-only mount under /private)
echo "[INFO] Staging fluent-bit bundle..."
rm -rf /tmp/flb && mkdir -p /tmp/flb
cp -r /private/etc /tmp/flb/
mkdir -p /tmp/flb/packages
cp -r /private/packages/* /tmp/flb/packages/ 2>/dev/null || true
# 解压bundle到/tmp
echo "[INFO] Extracting fluent-bit bundle..."
cp -r /private/etc /tmp
cp -r /private/packages /tmp
cd /tmp
# Helper: check and install a local deb if not already satisfied
ensure_lib() {
local soname="$1"; shift
local pattern="$1"; shift
if ldconfig -p 2>/dev/null | grep -q "$soname"; then
echo "[OK] $soname already present"
return 0
fi
local deb="$(ls /tmp/flb/packages/$pattern 2>/dev/null | head -n1 || true)"
if [[ -n "$deb" ]]; then
echo "[INFO] Installing local dependency: $(basename "$deb")"
dpkg -i "$deb" >/dev/null 2>&1 || true
else
echo "[WARN] Local deb for $soname not found (pattern=$pattern)"
fi
if ! ldconfig -p 2>/dev/null | grep -q "$soname"; then
echo "[WARN] $soname still missing after local install; attempting apt fallback"
apt-get update -qq || true
case "$soname" in
libpq.so.5) apt-get install -y -qq libpq5 || true ;;
libyaml-0.so.2) apt-get install -y -qq libyaml-0-2 || true ;;
esac
fi
ldconfig 2>/dev/null || true
}
# Offline-first: satisfy runtime deps from local debs, fallback to apt only if necessary
ensure_lib "libpq.so.5" "libpq5_*_amd64.deb"
ensure_lib "libyaml-0.so.2" "libyaml-0-2_*_amd64.deb"
ensure_lib "libsasl2.so.2" "libsasl2-2_*_amd64.deb"
ensure_lib "libldap-2.5.so.0" "libldap-2.5-0_*_amd64.deb"
# Install fluent-bit main package from local bundle
FLB_DEB="$(ls /tmp/flb/packages/fluent-bit_*_amd64.deb 2>/dev/null | head -n1 || true)"
if [[ -z "$FLB_DEB" ]]; then
echo "[ERROR] fluent-bit deb not found under /private/packages" >&2
exit 1
fi
echo "[INFO] Installing Fluent Bit: $(basename "$FLB_DEB")"
dpkg -i "$FLB_DEB" >/dev/null 2>&1 || true
# If dpkg reported unresolved dependencies, try apt -f only as last resort
if ! command -v /opt/fluent-bit/bin/fluent-bit >/dev/null 2>&1; then
echo "[WARN] fluent-bit binary missing after dpkg; attempting apt --fix-broken"
apt-get install -f -y -qq || true
fi
# Ensure runtime library dependencies are satisfied (libsasl2, libldap are required via libpq/curl)
MISSING=$(ldd /opt/fluent-bit/bin/fluent-bit 2>/dev/null | awk '/not found/{print $1}' | xargs -r echo || true)
if [[ -n "$MISSING" ]]; then
echo "[WARN] missing shared libs: $MISSING"
apt-get update -qq || true
apt-get install -y -qq libsasl2-2 libldap-2.5-0 || true
apt-get install -f -y -qq || true
fi
# 安装 Fluent Bit 从 deb 包
echo "[INFO] Installing Fluent Bit from deb package..."
dpkg -i /tmp/packages/fluent-bit_3.1.9_amd64.deb || true
apt-get install -f -y -qq # 解决依赖问题
# 验证 Fluent Bit 可以运行
echo "[INFO] Fluent Bit version:"
/opt/fluent-bit/bin/fluent-bit --version || { echo "[ERROR] fluent-bit not installed or libraries missing" >&2; exit 1; }
/opt/fluent-bit/bin/fluent-bit --version
# Place configuration
# 创建配置目录
mkdir -p /etc/fluent-bit
cp -r /tmp/flb/etc/* /etc/fluent-bit/
cp -r /tmp/etc/* /etc/fluent-bit/
# Create logs/buffers dirs
# 创建日志和缓冲区目录
mkdir -p /logs/train /logs/infer /buffers
chmod 755 /logs/train /logs/infer /buffers
# 控制日志目录权限:默认对宿主 bind mount 目录采用 1777可由环境变量关闭
: "${ARGUS_LOGS_WORLD_WRITABLE:=1}"
if [[ "${ARGUS_LOGS_WORLD_WRITABLE}" == "1" ]]; then
chmod 1777 /logs/train /logs/infer || true
else
chmod 755 /logs/train /logs/infer || true
fi
# 缓冲目录仅供进程使用,不对外开放写入
chmod 770 /buffers || true
# 目录属主设置为 fluent-bit不影响 1777 粘滞位)
chown -R fluent-bit:fluent-bit /logs /buffers 2>/dev/null || true
# Wait for Elasticsearch via bash /dev/tcp to avoid curl dependency
echo "[INFO] Waiting for Elasticsearch to be ready (tcp ${ES_HOST}:${ES_PORT})..."
for i in $(seq 1 120); do
if exec 3<>/dev/tcp/${ES_HOST}/${ES_PORT}; then
exec 3<&- 3>&-
echo "[INFO] Elasticsearch is ready"
break
fi
[[ $i -eq 120 ]] && { echo "[ERROR] ES not reachable" >&2; exit 1; }
sleep 1
# 等待 Elasticsearch 就绪
echo "[INFO] Waiting for Elasticsearch to be ready..."
while ! curl -fs http://${ES_HOST}:${ES_PORT}/_cluster/health >/dev/null 2>&1; do
echo " Waiting for ES at ${ES_HOST}:${ES_PORT}..."
sleep 5
done
echo "[INFO] Elasticsearch is ready"
# 启动 Fluent Bit
echo "[INFO] Starting Fluent Bit with configuration from /etc/fluent-bit/"
echo "[INFO] Command: /opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf"
exec /opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf
exec /opt/fluent-bit/bin/fluent-bit \
--config=/etc/fluent-bit/fluent-bit.conf

View File

@ -2,7 +2,7 @@
set -euo pipefail
ES_HOST="${ELASTICSEARCH_HOSTS:-http://es:9200}"
KB_HOST="${KB_HOST:-http://127.0.0.1:5601}"
KB_HOST="http://localhost:5601"
echo "[INFO] Starting Kibana post-start configuration..."
@ -83,37 +83,50 @@ fix_replicas_idempotent() {
}
# 幂等创建数据视图
create_or_ensure_data_view() {
local name="$1"
local title="$2"
local list_response
list_response=$(curl -fsS "$KB_HOST/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null || echo "")
if [ -z "$list_response" ]; then
echo "[WARN] Failed to list data views, skipping creation check for $title"
return
fi
if echo "$list_response" | grep -Fq "\"title\":\"$title\""; then
echo "[INFO] Data view $title already exists, skipping"
return
fi
echo "[INFO] Creating data view for $title indices (allowNoIndex)"
curl -fsS -X POST "$KB_HOST/api/data_views/data_view?allowNoIndex=true" \
-H 'kbn-xsrf: true' \
-H 'Content-Type: application/json' \
-d "{\"data_view\":{\"name\":\"$name\",\"title\":\"$title\",\"timeFieldName\":\"@timestamp\",\"allowNoIndex\":true}}" \
>/dev/null && echo "[OK] Created $name data view" || echo "[WARN] Failed to create $name data view"
}
create_data_views_idempotent() {
echo "[INFO] Checking and creating data views..."
create_or_ensure_data_view "train" "train-*"
create_or_ensure_data_view "infer" "infer-*"
# 检查是否存在匹配的索引
local train_indices=$(curl -s "$ES_HOST/_cat/indices/train-*?h=index" 2>/dev/null | wc -l || echo "0")
local infer_indices=$(curl -s "$ES_HOST/_cat/indices/infer-*?h=index" 2>/dev/null | wc -l || echo "0")
# 创建 train 数据视图
if [ "$train_indices" -gt 0 ]; then
# 检查数据视图是否已存在
local train_exists=$(curl -s "$KB_HOST/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null | grep '"title":"train-\*"' | wc -l )
if [ "$train_exists" -eq 0 ]; then
echo "[INFO] Creating data view for train-* indices"
curl -fsS -X POST "$KB_HOST/api/data_views/data_view" \
-H 'kbn-xsrf: true' \
-H 'Content-Type: application/json' \
-d '{"data_view":{"name":"train","title":"train-*","timeFieldName":"@timestamp"}}' \
>/dev/null && echo "[OK] Created train data view" || echo "[WARN] Failed to create train data view"
else
echo "[INFO] Train data view already exists, skipping"
fi
else
echo "[INFO] No train-* indices found, skipping train data view creation"
fi
# 创建 infer 数据视图
if [ "$infer_indices" -gt 0 ]; then
# 检查数据视图是否已存在
local infer_exists=$(curl -s "$KB_HOST/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null | grep '"title":"infer-\*"' | wc -l )
if [ "$infer_exists" -eq 0 ]; then
echo "[INFO] Creating data view for infer-* indices"
curl -fsS -X POST "$KB_HOST/api/data_views/data_view" \
-H 'kbn-xsrf: true' \
-H 'Content-Type: application/json' \
-d '{"data_view":{"name":"infer","title":"infer-*","timeFieldName":"@timestamp"}}' \
>/dev/null && echo "[OK] Created infer data view" || echo "[WARN] Failed to create infer data view"
else
echo "[INFO] Infer data view already exists, skipping"
fi
else
echo "[INFO] No infer-* indices found, skipping infer data view creation"
fi
}
# 主逻辑

View File

@ -32,42 +32,3 @@ fi
echo "[OK] 初始化完成: private/argus/log/{elasticsearch,kibana}"
echo "[INFO] Fluent-bit files should be in fluent-bit/ directory"
# 准备 Fluent Bit 离线依赖(从 metric all-in-one-full 复制 deb 到 ../fluent-bit/build/packages
FLB_BUILD_PACKAGES_DIR="$root/../fluent-bit/build/packages"
mkdir -p "$FLB_BUILD_PACKAGES_DIR"
for deb in \
"$project_root/src/metric/client-plugins/all-in-one-full/plugins/fluent-bit/bin/libyaml-0-2_"*_amd64.deb \
"$project_root/src/metric/client-plugins/all-in-one-full/plugins/fluent-bit/bin/libpq5_"*_amd64.deb ; do
if ls $deb >/dev/null 2>&1; then
for f in $deb; do
base="$(basename "$f")"
if [[ ! -f "$FLB_BUILD_PACKAGES_DIR/$base" ]]; then
cp "$f" "$FLB_BUILD_PACKAGES_DIR/"
echo " [+] copied $base"
fi
done
fi
done
# 额外:从 all-in-one-full 的 ubuntu22/curl.tar.gz 解包必要依赖libsasl2/ldap便于离线安装
CURLOPT_TAR="$project_root/src/metric/client-plugins/all-in-one-full/deps/ubuntu22/curl.tar.gz"
if [[ -f "$CURLOPT_TAR" ]]; then
tmpdir=$(mktemp -d)
if tar -xzf "$CURLOPT_TAR" -C "$tmpdir" 2>/dev/null; then
for p in \
libsasl2-2_*_amd64.deb \
libsasl2-modules-db_*_amd64.deb \
libldap-2.5-0_*_amd64.deb \
libidn2-0_*_amd64.deb \
libbrotli1_*_amd64.deb \
libssl3_*_amd64.deb ; do
src=$(ls "$tmpdir"/curl/$p 2>/dev/null | head -n1 || true)
if [[ -n "$src" ]]; then
base="$(basename "$src")"
[[ -f "$FLB_BUILD_PACKAGES_DIR/$base" ]] || cp "$src" "$FLB_BUILD_PACKAGES_DIR/" && echo " [+] staged $base"
fi
done
fi
rm -rf "$tmpdir"
fi

View File

@ -28,11 +28,11 @@ fi
docker exec "$container_name" mkdir -p /logs/train /logs/infer
# 写入训练日志 (host01)
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
# 写入推理日志 (host01)
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "cat <<'STACK' >> /logs/infer/infer-demo.log
Traceback (most recent call last):
File \"inference.py\", line 15, in <module>

View File

@ -28,13 +28,13 @@ fi
docker exec "$container_name" mkdir -p /logs/train /logs/infer
# 写入训练日志 (host02)
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
# 写入推理日志 (host02)
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
echo "[OK] 已通过docker exec写入测试日志到 host02 容器内:"
echo " - /logs/train/train-demo.log"

View File

@ -115,32 +115,20 @@ show_step "Health" "Check service health"
echo "[INFO] Checking service health..."
# 检查 Elasticsearch 健康状态
health_check_ok=1
es_health=$(curl -s "http://localhost:9200/_cluster/health" | grep -o '"status":"[^"]*"' | cut -d'"' -f4)
if [ "$es_health" = "green" ] || [ "$es_health" = "yellow" ]; then
echo "✅ Elasticsearch health: $es_health"
else
echo "❌ Elasticsearch health: $es_health"
health_check_ok=0
fi
# 检查 Kibana 状态
if curl -fs "http://localhost:5601/api/status" >/dev/null 2>&1; then
kb_status="available"
echo "✅ Kibana status: $kb_status"
data_views_json=$(curl -fs "http://localhost:5601/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null || true)
if echo "$data_views_json" | grep -F '"title":"train-*"' >/dev/null 2>&1 && \
echo "$data_views_json" | grep -F '"title":"infer-*"' >/dev/null 2>&1; then
echo "✅ Kibana data views: train-* and infer-* present"
else
echo "❌ Kibana data views missing: train-* or infer-*"
health_check_ok=0
fi
else
kb_status="unavailable"
echo "⚠️ Kibana status: $kb_status"
health_check_ok=0
fi
# 检查 Fluent-Bit 指标
@ -151,13 +139,6 @@ if [ "$fb_host01_uptime" -gt 0 ] && [ "$fb_host02_uptime" -gt 0 ]; then
echo "✅ Fluent-Bit services: host01 uptime=${fb_host01_uptime}s, host02 uptime=${fb_host02_uptime}s"
else
echo "⚠️ Fluent-Bit services: host01 uptime=${fb_host01_uptime}s, host02 uptime=${fb_host02_uptime}s"
health_check_ok=0
fi
if [ "$health_check_ok" -eq 1 ]; then
true
else
false
fi
verify_step "Service health check"

View File

@ -13,8 +13,6 @@ class AppConfig:
scheduler_interval_seconds: int
node_id_prefix: str
auth_mode: str
target_prefer_net_cidrs: str
target_reachability_check: bool
def _get_int_env(name: str, default: int) -> int:
@ -29,12 +27,6 @@ def _get_int_env(name: str, default: int) -> int:
def load_config() -> AppConfig:
"""读取环境变量生成配置对象,方便统一管理运行参数。"""
def _bool_env(name: str, default: bool) -> bool:
raw = os.environ.get(name)
if raw is None or raw.strip() == "":
return default
return raw.strip().lower() in ("1", "true", "yes", "on")
return AppConfig(
db_path=os.environ.get("DB_PATH", "/private/argus/master/db.sqlite3"),
metric_nodes_json_path=os.environ.get(
@ -45,6 +37,4 @@ def load_config() -> AppConfig:
scheduler_interval_seconds=_get_int_env("SCHEDULER_INTERVAL_SECONDS", 30),
node_id_prefix=os.environ.get("NODE_ID_PREFIX", "A"),
auth_mode=os.environ.get("AUTH_MODE", "disabled"),
target_prefer_net_cidrs=os.environ.get("TARGET_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16"),
target_reachability_check=_bool_env("TARGET_REACHABILITY_CHECK", False),
)

View File

@ -1,10 +1,8 @@
from __future__ import annotations
import ipaddress
import logging
import socket
import threading
from typing import Optional, Iterable, Dict, Any, List
from typing import Optional
from .config import AppConfig
from .storage import Storage
@ -36,117 +34,10 @@ class StatusScheduler:
self._pending_nodes_json.set()
def generate_nodes_json(self) -> None:
"""根据在线节点生成 Prometheus 抓取目标,优先 overlay IP。
候选顺序meta.overlay_ip > hostname A 记录命中偏好网段> meta.ip
可选 reachability 检查TARGET_REACHABILITY_CHECK=true 9100/9400 做一次 1s TCP 连接测试
选择首个可达的候选全部失败则按顺序取第一个并记录日志
"""
with self._nodes_json_lock:
rows = self._storage.get_online_nodes_meta()
prefer_cidrs = self._parse_cidrs(self._config.target_prefer_net_cidrs)
reachability = self._config.target_reachability_check
result: List[Dict[str, Any]] = []
for row in rows:
meta = row.get("meta", {})
hostname = meta.get("hostname") or row.get("name")
labels = row.get("labels") or []
overlay_ip = meta.get("overlay_ip")
legacy_ip = meta.get("ip")
host_candidates = self._resolve_host_ips(hostname)
host_pref = self._pick_by_cidrs(host_candidates, prefer_cidrs)
candidates: List[str] = []
for ip in [overlay_ip, host_pref, legacy_ip]:
if ip and ip not in candidates:
candidates.append(ip)
chosen = None
if reachability:
ports = [9100]
try:
if int(meta.get("gpu_number", 0)) > 0:
ports.append(9400)
except Exception:
pass
for ip in candidates:
if any(self._reachable(ip, p, 1.0) for p in ports):
chosen = ip
break
if not chosen:
chosen = candidates[0] if candidates else legacy_ip
if not chosen:
# ultimate fallback: 127.0.0.1 (should not happen)
chosen = "127.0.0.1"
self._logger.warning("No candidate IPs for node; falling back", extra={"node": row.get("node_id")})
if chosen and ipaddress.ip_address(chosen) in ipaddress.ip_network("172.22.0.0/16"):
self._logger.warning(
"Prometheus target uses docker_gwbridge address; prefer overlay",
extra={"node": row.get("node_id"), "ip": chosen},
)
result.append(
{
"node_id": row.get("node_id"),
"user_id": meta.get("user"),
"ip": chosen,
"hostname": hostname,
"labels": labels if isinstance(labels, list) else [],
}
)
atomic_write_json(self._config.metric_nodes_json_path, result)
self._logger.info("nodes.json updated", extra={"count": len(result)})
# ---------------------------- helpers ----------------------------
@staticmethod
def _parse_cidrs(raw: str) -> List[ipaddress.IPv4Network]:
nets: List[ipaddress.IPv4Network] = []
for item in (x.strip() for x in (raw or "").split(",")):
if not item:
continue
try:
net = ipaddress.ip_network(item, strict=False)
if isinstance(net, ipaddress.IPv4Network):
nets.append(net)
except ValueError:
continue
return nets
@staticmethod
def _resolve_host_ips(hostname: str) -> List[str]:
ips: List[str] = []
try:
infos = socket.getaddrinfo(hostname, None, family=socket.AF_INET)
for info in infos:
ip = info[4][0]
if ip not in ips:
ips.append(ip)
except OSError:
pass
return ips
@staticmethod
def _pick_by_cidrs(candidates: Iterable[str], prefer: List[ipaddress.IPv4Network]) -> str | None:
for net in prefer:
for ip in candidates:
try:
if ipaddress.ip_address(ip) in net:
return ip
except ValueError:
continue
return None
@staticmethod
def _reachable(ip: str, port: int, timeout: float) -> bool:
try:
with socket.create_connection((ip, port), timeout=timeout):
return True
except OSError:
return False
online_nodes = self._storage.get_online_nodes()
atomic_write_json(self._config.metric_nodes_json_path, online_nodes)
self._logger.info("nodes.json updated", extra={"count": len(online_nodes)})
# ------------------------------------------------------------------
# internal loop

View File

@ -324,35 +324,9 @@ class Storage:
{
"node_id": row["id"],
"user_id": meta.get("user"),
"ip": meta.get("ip"), # kept for backward-compat; preferred IP selection handled in scheduler
"ip": meta.get("ip"),
"hostname": meta.get("hostname", row["name"]),
"labels": labels if isinstance(labels, list) else [],
}
)
return result
def get_online_nodes_meta(self) -> List[Dict[str, Any]]:
"""返回在线节点的原始 meta 与名称、标签,交由上层选择目标 IP。
每项包含{ node_id, name, meta, labels }
"""
with self._lock:
cur = self._conn.execute(
"SELECT id, name, meta_json, labels_json FROM nodes WHERE status = ? ORDER BY id ASC",
("online",),
)
rows = cur.fetchall()
result: List[Dict[str, Any]] = []
for row in rows:
meta = json.loads(row["meta_json"]) if row["meta_json"] else {}
labels = json.loads(row["labels_json"]) if row["labels_json"] else []
result.append(
{
"node_id": row["id"],
"name": row["name"],
"meta": meta if isinstance(meta, dict) else {},
"labels": labels if isinstance(labels, list) else [],
}
)
return result

View File

@ -3,13 +3,12 @@ set -euo pipefail
usage() {
cat >&2 <<'USAGE'
Usage: $0 [--intranet] [--offline] [--tag <image_tag>] [--no-cache]
Usage: $0 [--intranet] [--offline] [--tag <image_tag>]
Options:
--intranet 使用指定的 PyPI 镜像源(默认清华镜像)。
--offline 完全离线构建,依赖 offline_wheels/ 目录中的离线依赖包。
--tag <image_tag> 自定义镜像标签,默认 argus-master:latest。
--no-cache 不使用 Docker 构建缓存。
USAGE
}
@ -20,7 +19,6 @@ IMAGE_TAG="${IMAGE_TAG:-argus-master:latest}"
DOCKERFILE="src/master/Dockerfile"
BUILD_ARGS=()
OFFLINE_MODE=0
NO_CACHE=0
source "$PROJECT_ROOT/scripts/common/build_user.sh"
load_build_user
@ -47,11 +45,6 @@ while [[ "$#" -gt 0 ]]; do
IMAGE_TAG="$2"
shift 2
;;
--no-cache)
NO_CACHE=1
BUILD_ARGS+=("--no-cache")
shift
;;
-h|--help)
usage
exit 0

View File

@ -4,4 +4,4 @@
/client-plugins/demo-all-in-one/publish/
/client-plugins/demo-all-in-one/checklist
/client-plugins/demo-all-in-one/VERSION
/client-plugins/all-in-one-full/artifact/
/client-plugins/all-in-one-full/

View File

@ -4,15 +4,6 @@
set -e
# PID 文件检测,防止重复执行
PIDFILE="/var/run/check_health.pid"
if [ -f "$PIDFILE" ] && kill -0 $(cat "$PIDFILE") 2>/dev/null; then
echo "健康检查脚本已在运行中,跳过本次执行" >&2
exit 0
fi
echo $$ > "$PIDFILE"
trap "rm -f $PIDFILE" EXIT
# 获取脚本所在目录
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
HEALTH_LOG_FILE="$SCRIPT_DIR/.health_log"

View File

@ -200,22 +200,22 @@ parse_version_info() {
VERSION=$(grep '"version"' "$VERSION_FILE_PATH" | sed 's/.*"version": *"\([^"]*\)".*/\1/')
BUILD_TIME=$(grep '"build_time"' "$VERSION_FILE_PATH" | sed 's/.*"build_time": *"\([^"]*\)".*/\1/')
# 解析 artifact_list(跳过字段名本身)
grep -A 100 '"artifact_list"' "$VERSION_FILE_PATH" | grep -v '"artifact_list"' | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
# 解析 artifact_list
grep -A 100 '"artifact_list"' "$VERSION_FILE_PATH" | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
component=$(echo "$line" | sed 's/.*"\([^"]*\)":\s*"[^"]*".*/\1/')
version=$(echo "$line" | sed 's/.*"[^"]*":\s*"\([^"]*\)".*/\1/')
echo "$component:$version" >> "$TEMP_DIR/components.txt"
done
# 解析 checksums(跳过字段名本身)
grep -A 100 '"checksums"' "$VERSION_FILE_PATH" | grep -v '"checksums"' | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
# 解析 checksums
grep -A 100 '"checksums"' "$VERSION_FILE_PATH" | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
component=$(echo "$line" | sed 's/.*"\([^"]*\)":\s*"[^"]*".*/\1/')
checksum=$(echo "$line" | sed 's/.*"[^"]*":\s*"\([^"]*\)".*/\1/')
echo "$component:$checksum" >> "$TEMP_DIR/checksums.txt"
done
# 解析 install_order(跳过字段名本身,只取数组元素)
grep -A 100 '"install_order"' "$VERSION_FILE_PATH" | grep -v '"install_order"' | grep -E '^\s*"[^"]+"' | while read line; do
# 解析 install_order
grep -A 100 '"install_order"' "$VERSION_FILE_PATH" | grep -E '^\s*"[^"]+"' | while read line; do
component=$(echo "$line" | sed 's/.*"\([^"]*\)".*/\1/')
echo "$component" >> "$TEMP_DIR/install_order.txt"
done
@ -317,152 +317,85 @@ create_install_dirs() {
log_success "安装目录创建完成: $INSTALL_DIR"
}
# 获取系统版本
get_system_version() {
if [[ ! -f /etc/os-release ]]; then
log_error "无法检测操作系统版本"
return 1
fi
source /etc/os-release
# 提取主版本号
case "$VERSION_ID" in
"20.04")
echo "ubuntu20"
;;
"22.04")
echo "ubuntu22"
;;
*)
log_warning "未识别的Ubuntu版本: $VERSION_ID尝试使用ubuntu22"
echo "ubuntu22"
;;
esac
}
# 安装系统依赖包
install_system_deps() {
log_info "检查系统依赖包..."
local script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
local deps_dir="$script_dir/deps"
# 检查deps目录是否存在
if [[ ! -d "$deps_dir" ]]; then
log_info "deps 目录不存在,跳过系统依赖包安装"
return 0
fi
# 获取系统版本对应的依赖目录
local system_version=$(get_system_version)
local version_deps_dir="$deps_dir/$system_version"
log_info "检测到系统版本: $system_version"
# 检查版本特定的依赖目录是否存在
if [[ ! -d "$version_deps_dir" ]]; then
log_warning "未找到 $system_version 版本的依赖目录: $version_deps_dir"
# 回退到旧的逻辑检查根deps目录
local deps_count=$(find "$deps_dir" -name "*.tar.gz" | wc -l)
if [[ $deps_count -eq 0 ]]; then
log_info "deps 目录中没有 tar.gz 文件,跳过系统依赖包安装"
return 0
fi
version_deps_dir="$deps_dir"
else
# 检查版本目录中是否有tar.gz文件
local deps_count=$(find "$version_deps_dir" -name "*.tar.gz" | wc -l)
if [[ $deps_count -eq 0 ]]; then
log_info "$system_version 版本目录中没有 tar.gz 文件,跳过系统依赖包安装"
return 0
fi
# 检查是否有tar.gz文件
local deps_count=$(find "$deps_dir" -name "*.tar.gz" | wc -l)
if [[ $deps_count -eq 0 ]]; then
log_info "deps 目录中没有 tar.gz 文件,跳过系统依赖包安装"
return 0
fi
log_info "找到 $system_version 版本的依赖包,开始安装..."
log_info "找到 $deps_count 个系统依赖包,开始安装..."
# 创建临时目录用于解压依赖包
local deps_temp_dir="${TEMP_DIR:-/tmp}/deps"
local deps_temp_dir="$TEMP_DIR/deps"
mkdir -p "$deps_temp_dir"
# 定义要检查的核心依赖
local CORE_DEPS=(jq cron curl)
local FAILED_DEPS=()
# 处理每个tar.gz文件
find "$version_deps_dir" -name "*.tar.gz" | while read tar_file; do
find "$deps_dir" -name "*.tar.gz" | while read tar_file; do
local tar_basename=$(basename "$tar_file")
local extract_name="${tar_basename%.tar.gz}"
log_info "处理依赖包: $tar_basename"
# 解压到临时目录
local extract_dir="$deps_temp_dir/$extract_name"
mkdir -p "$extract_dir"
if tar -xzf "$tar_file" -C "$extract_dir" 2>/dev/null; then
log_success " $tar_basename 解压完成"
else
log_error " $tar_basename 解压失败"
continue
fi
# 进入解压目录查找deb包
cd "$extract_dir" || continue
local deb_files=(*.deb)
if [[ ${#deb_files[@]} -gt 0 ]]; then
log_info " 找到 ${#deb_files[@]} 个 deb 包,开始安装..."
for deb in "${deb_files[@]}"; do
local pkg_name
pkg_name=$(dpkg-deb -f "$deb" Package 2>/dev/null)
# 如果已安装,则跳过
if dpkg -s "$pkg_name" &>/dev/null; then
log_success " $pkg_name 已安装,跳过"
continue
fi
# 尝试安装
log_info " 安装 $pkg_name..."
if DEBIAN_FRONTEND=noninteractive dpkg -i "$deb" &>/dev/null; then
log_success " $pkg_name 安装成功"
cd "$extract_dir"
local deb_count=$(find . -name "*.deb" | wc -l)
if [[ $deb_count -gt 0 ]]; then
log_info " 找到 $deb_count 个 deb 包,开始安装..."
# 1. 先尝试安装所有deb包
log_info " 第1步批量安装deb包..."
if dpkg -i *.deb 2>/dev/null; then
log_success " 所有deb包安装成功"
else
log_warning " 部分deb包安装失败可能存在依赖问题"
# 2. 使用apt-get修复依赖
log_info " 第2步修复依赖关系..."
if apt-get install -f -y; then
log_success " 依赖关系修复完成"
else
log_warning " $pkg_name 安装失败,尝试修复依赖..."
if DEBIAN_FRONTEND=noninteractive apt-get install -f -y &>/dev/null; then
if dpkg -s "$pkg_name" &>/dev/null; then
log_success " $pkg_name 修复安装成功"
else
log_error " $pkg_name 仍未安装成功"
FAILED_DEPS+=("$pkg_name")
fi
else
log_error " $pkg_name 自动修复失败"
FAILED_DEPS+=("$pkg_name")
fi
log_error " 依赖关系修复失败"
# 继续处理其他包,不退出
fi
done
fi
else
log_info " $tar_basename 中没有找到deb包跳过"
fi
# 返回到依赖临时目录
cd "$deps_temp_dir" || continue
cd "$deps_temp_dir"
done
# 检查并启动 cron 服务
start_cron_service
# 总结安装结果
if [[ ${#FAILED_DEPS[@]} -gt 0 ]]; then
log_error "以下系统依赖未能成功安装,安装终止,请手动安装后重试:"
for f in "${FAILED_DEPS[@]}"; do
echo " - $f"
done
exit 1
else
log_success "系统依赖包安装完成,全部就绪"
fi
log_success "系统依赖包安装完成"
}
# 启动 cron 服务
@ -704,18 +637,6 @@ EOF
log_success "安装记录已创建: $install_record_file"
}
# 检查cron任务是否已存在
check_cron_task_exists() {
local task_pattern="$1"
local temp_cron="$2"
if grep -q "$task_pattern" "$temp_cron"; then
return 0 # 任务已存在
else
return 1 # 任务不存在
fi
}
# 设置健康检查定时任务
setup_health_check_cron() {
log_info "设置健康检查定时任务..."
@ -740,7 +661,7 @@ setup_health_check_cron() {
crontab -l 2>/dev/null > "$temp_cron" || touch "$temp_cron"
# 检查并删除旧的健康检查任务
if check_cron_task_exists "check_health.sh" "$temp_cron"; then
if grep -q "check_health.sh" "$temp_cron"; then
log_info "发现旧的健康检查定时任务,正在更新..."
# 删除所有包含check_health.sh的行
grep -v "check_health.sh" "$temp_cron" > "$temp_cron.new"
@ -795,7 +716,7 @@ setup_dns_sync_cron() {
crontab -l 2>/dev/null > "$temp_cron" || touch "$temp_cron"
# 检查并删除旧的 DNS 同步任务
if check_cron_task_exists "sync_dns.sh" "$temp_cron"; then
if grep -q "sync_dns.sh" "$temp_cron"; then
log_info "发现旧的 DNS 同步定时任务,正在更新..."
# 删除所有包含sync_dns.sh的行
grep -v "sync_dns.sh" "$temp_cron" > "$temp_cron.new"
@ -803,15 +724,16 @@ setup_dns_sync_cron() {
log_info "旧的 DNS 同步定时任务已删除"
fi
# 添加新的定时任务(每1分钟执行一次)
# 添加新的定时任务(每30秒执行一次)
# 直接使用版本目录中的 DNS 同步脚本
echo "# Argus-Metrics DNS 同步定时任务" >> "$temp_cron"
echo "* * * * * $sync_dns_script >> $INSTALL_DIR/.dns_sync.log 2>&1" >> "$temp_cron"
echo "* * * * * sleep 30; $sync_dns_script >> $INSTALL_DIR/.dns_sync.log 2>&1" >> "$temp_cron"
# 安装新的crontab
if crontab "$temp_cron"; then
log_success "DNS 同步定时任务设置成功"
log_info " 执行频率: 每1分钟"
log_info " 执行频率: 每30秒"
log_info " 日志文件: $INSTALL_DIR/.dns_sync.log"
log_info " 查看定时任务: crontab -l"
log_info " 删除定时任务: crontab -e"
@ -849,7 +771,7 @@ setup_version_check_cron() {
crontab -l > "$temp_cron" 2>/dev/null || touch "$temp_cron"
# 检查是否已存在版本校验定时任务
if check_cron_task_exists "check_version.sh" "$temp_cron"; then
if grep -q "check_version.sh" "$temp_cron"; then
log_info "发现旧的版本校验定时任务,正在更新..."
# 删除所有包含check_version.sh的行
grep -v "check_version.sh" "$temp_cron" > "$temp_cron.new"
@ -902,7 +824,7 @@ setup_restart_cron() {
crontab -l > "$temp_cron" 2>/dev/null || touch "$temp_cron"
# 检查是否已存在自动重启定时任务
if check_cron_task_exists "restart_unhealthy.sh" "$temp_cron"; then
if grep -q "restart_unhealthy.sh" "$temp_cron"; then
log_info "发现旧的自动重启定时任务,正在更新..."
# 删除所有包含restart_unhealthy.sh的行
grep -v "restart_unhealthy.sh" "$temp_cron" > "$temp_cron.new"
@ -963,9 +885,9 @@ main() {
check_system
find_version_file
create_install_dirs
install_system_deps
parse_version_info
parse_version_info
verify_checksums
install_system_deps
install_components
copy_config_files
create_install_record
@ -973,20 +895,6 @@ main() {
setup_dns_sync_cron
setup_version_check_cron
setup_restart_cron
# 注释掉立即执行健康检查避免与cron任务重复执行
# log_info "立即执行一次健康检查..."
# local check_health_script="$INSTALL_DIR/check_health.sh"
# if [[ -f "$check_health_script" ]]; then
# if "$check_health_script" >> "$INSTALL_DIR/.health_check.log" 2>&1; then
# log_success "健康检查执行完成"
# else
# log_warning "健康检查执行失败,请检查日志: $INSTALL_DIR/.health_check.log"
# fi
# else
# log_warning "健康检查脚本不存在: $check_health_script"
# fi
show_install_info
}

View File

@ -29,68 +29,26 @@ log_error() {
show_help() {
echo "Argus-Metric Artifact 发布脚本"
echo
echo "用法: $0 <版本号> [选项]"
echo "用法: $0 <版本号>"
echo
echo "参数:"
echo " <版本号> 要发布的版本号,对应 artifact 目录中的版本"
echo
echo "选项:"
echo " --output-dir <路径> 指定输出目录 (默认: /private/argus/ftp/share/)"
echo " --owner <uid:gid> 指定文件所有者 (默认: 2133:2015)"
echo " -h, --help 显示此帮助信息"
echo " <版本号> 要发布的版本号,对应 artifact 目录中的版本"
echo
echo "示例:"
echo " $0 1.20.0 # 使用默认配置发布"
echo " $0 1.20.0 --output-dir /tmp/publish # 指定输出目录"
echo " $0 1.20.0 --owner 1000:1000 # 指定文件所有者"
echo " $0 1.20.0 --output-dir /srv/ftp --owner root:root # 同时指定两者"
echo " $0 1.20.0 # 发布 1.20.0 版本"
echo
}
# 默认配置
DEFAULT_PUBLISH_DIR="/private/argus/ftp/share/"
DEFAULT_OWNER="2133:2015"
# 解析参数
VERSION=""
PUBLISH_DIR="$DEFAULT_PUBLISH_DIR"
OWNER="$DEFAULT_OWNER"
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
show_help
exit 0
;;
--output-dir)
PUBLISH_DIR="$2"
shift 2
;;
--owner)
OWNER="$2"
shift 2
;;
*)
if [[ -z "$VERSION" ]]; then
VERSION="$1"
shift
else
log_error "未知参数: $1"
show_help
exit 1
fi
;;
esac
done
# 检查版本号是否提供
if [[ -z "$VERSION" ]]; then
# 检查参数
if [[ $# -ne 1 ]]; then
log_error "请提供版本号参数"
show_help
exit 1
fi
VERSION="$1"
ARTIFACT_DIR="artifact/$VERSION"
PUBLISH_DIR="/Users/sundapeng/Project/nlp/aiops/client-plugins/all-in-one/publish/"
# 检查版本目录是否存在
if [[ ! -d "$ARTIFACT_DIR" ]]; then
@ -99,32 +57,11 @@ if [[ ! -d "$ARTIFACT_DIR" ]]; then
fi
log_info "开始发布版本: $VERSION"
log_info "输出目录: $PUBLISH_DIR"
log_info "文件所有者: $OWNER"
# 确保发布目录存在
log_info "确保发布目录存在: $PUBLISH_DIR"
mkdir -p "$PUBLISH_DIR"
IFS=':' read -r OWNER_UID OWNER_GID <<< "$OWNER"
if [[ -z "$OWNER_UID" || -z "$OWNER_GID" ]]; then
log_error "--owner 格式不正确,应为 uid:gid"
exit 1
fi
CURRENT_UID=$(id -u)
CURRENT_GID=$(id -g)
if [[ "$OWNER_UID" != "$CURRENT_UID" || "$OWNER_GID" != "$CURRENT_GID" ]]; then
if [[ "$CURRENT_UID" -ne 0 ]]; then
log_error "当前用户 (${CURRENT_UID}:${CURRENT_GID}) 无法设置所有者为 ${OWNER_UID}:${OWNER_GID}"
log_error "请以目标用户运行脚本或预先调整目录权限"
exit 1
fi
NEED_CHOWN=true
else
NEED_CHOWN=false
fi
# 创建临时目录用于打包
TEMP_PACKAGE_DIR="/tmp/argus-metric-package-$$"
mkdir -p "$TEMP_PACKAGE_DIR"
@ -230,28 +167,17 @@ cd "$TEMP_PACKAGE_DIR"
tar -czf "$PUBLISH_DIR/$TAR_NAME" *
cd - > /dev/null
if [[ "$NEED_CHOWN" == true ]]; then
log_info "设置文件所有者为: $OWNER"
chown "$OWNER" "$PUBLISH_DIR/$TAR_NAME"
fi
# 清理临时目录
rm -rf "$TEMP_PACKAGE_DIR"
# 更新 LATEST_VERSION 文件
log_info "更新 LATEST_VERSION 文件..."
echo "$VERSION" > "$PUBLISH_DIR/LATEST_VERSION"
if [[ "$NEED_CHOWN" == true ]]; then
chown "$OWNER" "$PUBLISH_DIR/LATEST_VERSION"
fi
# 复制 DNS 配置文件到发布目录根目录(直接从 config 目录复制)
if [[ -f "config/dns.conf" ]]; then
log_info "复制 DNS 配置文件到发布目录根目录..."
cp "config/dns.conf" "$PUBLISH_DIR/"
if [[ "$NEED_CHOWN" == true ]]; then
chown "$OWNER" "$PUBLISH_DIR/dns.conf"
fi
log_success "DNS 配置文件复制完成: $PUBLISH_DIR/dns.conf"
else
log_warning "未找到 config/dns.conf 文件,跳过 DNS 配置文件复制"
@ -261,9 +187,6 @@ fi
if [[ -f "scripts/setup.sh" ]]; then
log_info "复制 setup.sh 到发布目录..."
cp "scripts/setup.sh" "$PUBLISH_DIR/"
if [[ "$NEED_CHOWN" == true ]]; then
chown "$OWNER" "$PUBLISH_DIR/setup.sh"
fi
fi
# 显示发布结果

View File

@ -2,15 +2,6 @@
# 此脚本会检查各组件的健康状态,并重启不健康的组件
# PID 文件检测,防止重复执行
PIDFILE="/var/run/restart_unhealthy.pid"
if [ -f "$PIDFILE" ] && kill -0 $(cat "$PIDFILE") 2>/dev/null; then
echo "自动重启脚本已在运行中,跳过本次执行" >&2
exit 0
fi
echo $$ > "$PIDFILE"
trap "rm -f $PIDFILE" EXIT
# 获取脚本所在目录
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INSTALL_RECORD_FILE="$SCRIPT_DIR/.install_record"

View File

@ -1,143 +1,244 @@
#!/bin/bash
# DNS 同步脚本
# 比较 FTP 根目录的 dns.conf 和本地的 dns.conf如果有变化则同步到 /etc/resolv.conf
set -e
# 颜色
RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
# 颜色定义
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# 日志函数
log_info() { echo -e "${BLUE}[INFO]${NC} $1" >&2; }
log_success() { echo -e "${GREEN}[SUCCESS]${NC} $1" >&2; }
log_warning() { echo -e "${YELLOW}[WARNING]${NC} $1" >&2; }
log_error() { echo -e "${RED}[ERROR]${NC} $1" >&2; }
# 日志函数 - 输出到 stderr 避免影响函数返回值
log_info() {
echo -e "${BLUE}[INFO]${NC} $1" >&2
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1" >&2
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1" >&2
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1" >&2
}
# 获取脚本所在目录
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LOCAL_DNS_CONF="/opt/argus-metric/dns.conf"
RESOLV_CONF="/etc/resolv.conf"
ALT_RESOLV_CONF="/run/resolv.conf"
LOG_FILE="/opt/argus-metric/.dns_sync.log"
REMOTE_DNS_CONF_URL=""
RESOLV_CONF="/etc/resolv.conf"
LOG_FILE="/opt/argus-metric/.dns_sync.log"
# 获取 FTP 配置
# 从环境变量或配置文件获取 FTP 服务器信息
get_ftp_config() {
# 优先从环境变量获取配置
log_info "获取 FTP 配置信息..."
# 如果环境变量中没有设置,则尝试从配置文件读取
if [[ -z "$FTP_SERVER" || -z "$FTP_USER" || -z "$FTP_PASSWORD" ]]; then
[[ -f "$SCRIPT_DIR/config.env" ]] && source "$SCRIPT_DIR/config.env"
local config_file="$SCRIPT_DIR/config.env"
if [[ -f "$config_file" ]]; then
log_info "从配置文件读取 FTP 配置: $config_file"
source "$config_file"
fi
else
log_info "使用环境变量中的 FTP 配置"
fi
# 设置默认值(如果环境变量和配置文件都没有设置)
FTP_SERVER="${FTP_SERVER:-localhost}"
FTP_USER="${FTP_USER:-ftpuser}"
FTP_PASSWORD="${FTP_PASSWORD:-ZGClab1234!}"
# 构建远程 DNS 配置文件 URL
REMOTE_DNS_CONF_URL="ftp://${FTP_USER}:${FTP_PASSWORD}@${FTP_SERVER}/dns.conf"
log_info "FTP 配置来源: ${FTP_CONFIG_SOURCE:-环境变量/配置文件}"
}
# 下载远程 dns.conf
# 下载远程 DNS 配置文件
download_remote_dns_conf() {
local tmp="/tmp/dns.remote.$$"
local temp_file="/tmp/dns.conf.remote.$$"
log_info "从 FTP 服务器下载 DNS 配置文件..."
log_info "远程地址: $REMOTE_DNS_CONF_URL"
log_info "FTP 服务器: $FTP_SERVER"
log_info "FTP 用户: $FTP_USER"
# 先测试 FTP 连接
log_info "测试 FTP 连接..."
if ! curl -u "${FTP_USER}:${FTP_PASSWORD}" -sfI "ftp://${FTP_SERVER}/" >/dev/null; then
log_error "无法连接到 FTP 服务器: $FTP_SERVER"; return 1
fi
if ! curl -u "${FTP_USER}:${FTP_PASSWORD}" -sf "ftp://${FTP_SERVER}/dns.conf" -o "$tmp" 2>/dev/null; then
log_error "下载 dns.conf 失败"; rm -f "$tmp"; return 1
fi
echo "$tmp"
}
# 文件比较
compare_files() { diff -q "$1" "$2" >/dev/null 2>&1; }
# 从 dns.conf 提取有效 IP
get_dns_ips() {
grep -Eo '^[0-9]{1,3}(\.[0-9]{1,3}){3}$' "$1" | sort -u
}
# 安全更新 resolv.conf保留符号链接
update_resolv_conf() {
local dns_conf="$1"
local dns_ips
mapfile -t dns_ips < <(get_dns_ips "$dns_conf")
[[ ${#dns_ips[@]} -eq 0 ]] && { log_warning "未检测到有效 DNS"; return; }
local target_file="$RESOLV_CONF"
if [[ ! -w "$RESOLV_CONF" ]]; then
log_warning "/etc/resolv.conf 不可写,使用兜底路径 $ALT_RESOLV_CONF"
target_file="$ALT_RESOLV_CONF"
fi
local temp="/tmp/resolv.new.$$"
cp "$target_file" "${target_file}.backup.$(date +%Y%m%d_%H%M%S)" 2>/dev/null || true
log_info "更新 DNS 配置文件: $target_file"
# 写入新的 nameserver 行
for ip in "${dns_ips[@]}"; do
echo "nameserver $ip"
done >"$temp"
# 追加原内容(去掉重复 nameserver
grep -v '^nameserver' "$target_file" >>"$temp" 2>/dev/null || true
awk '!a[$0]++' "$temp" >"${temp}.uniq"
# ⚙️ 使用 cat 原地覆盖,避免 mv 引发 “设备忙”
if cat "${temp}.uniq" >"$target_file" 2>/dev/null; then
chmod 644 "$target_file"
log_success "DNS 更新完成: ${dns_ips[*]}"
if curl -u "${FTP_USER}:${FTP_PASSWORD}" -sfI "ftp://${FTP_SERVER}/" >/dev/null 2>&1; then
log_success "FTP 服务器连接成功"
else
log_error "无法写入 $target_file,可能被系统锁定"
log_error "无法连接到 FTP 服务器: $FTP_SERVER"
log_error "请检查:"
log_error " 1. FTP 服务器是否运行"
log_error " 2. 网络连接是否正常"
log_error " 3. 服务器地址是否正确"
return 1
fi
# 测试 dns.conf 文件是否存在
log_info "检查远程 dns.conf 文件是否存在..."
if curl -u "${FTP_USER}:${FTP_PASSWORD}" -sfI "ftp://${FTP_SERVER}/dns.conf" >/dev/null 2>&1; then
log_success "远程 dns.conf 文件存在"
else
log_error "远程 dns.conf 文件不存在或无法访问"
log_error "请检查 FTP 服务器根目录下是否有 dns.conf 文件"
return 1
fi
# 尝试下载文件
log_info "开始下载 dns.conf 文件..."
if curl -u "${FTP_USER}:${FTP_PASSWORD}" -sf "ftp://${FTP_SERVER}/dns.conf" -o "$temp_file" 2>/dev/null; then
log_success "远程 DNS 配置文件下载成功"
echo "$temp_file"
else
log_error "下载 dns.conf 文件失败"
log_error "尝试手动测试命令:"
log_error " curl -u ${FTP_USER}:${FTP_PASSWORD} ftp://${FTP_SERVER}/dns.conf"
rm -f "$temp_file"
return 1
fi
rm -f "$temp" "${temp}.uniq"
}
# 检查 resolv.conf 是否包含 dns.conf 内容
ensure_dns_in_resolv() {
local dns_conf="$1"
local dns_ips
mapfile -t dns_ips < <(get_dns_ips "$dns_conf")
[[ ${#dns_ips[@]} -eq 0 ]] && return
# 比较两个文件是否相同
compare_files() {
local file1="$1"
local file2="$2"
if [[ ! -f "$file1" || ! -f "$file2" ]]; then
return 1
fi
# 使用 diff 比较文件内容
if diff -q "$file1" "$file2" >/dev/null 2>&1; then
return 0 # 文件相同
else
return 1 # 文件不同
fi
}
for ip in "${dns_ips[@]}"; do
if ! grep -q "nameserver $ip" "$RESOLV_CONF" 2>/dev/null; then
log_warning "检测到 /etc/resolv.conf 缺少 $ip,执行兜底修复"
update_resolv_conf "$dns_conf"
return
# 将 DNS 配置追加到 /etc/resolv.conf
update_resolv_conf() {
local dns_conf_file="$1"
log_info "更新 /etc/resolv.conf 文件..."
# 备份原始文件
if [[ -f "$RESOLV_CONF" ]]; then
cp "$RESOLV_CONF" "${RESOLV_CONF}.backup.$(date +%Y%m%d_%H%M%S)"
log_info "已备份原始 resolv.conf 文件"
fi
# 读取 DNS 配置文件并追加到 resolv.conf
while IFS= read -r line; do
# 跳过空行和注释行
[[ -z "$line" || "$line" =~ ^[[:space:]]*# ]] && continue
# 验证是否为有效的 IP 地址
if [[ "$line" =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
# 检查是否已存在相同的 nameserver 行
if ! grep -q "nameserver $line" "$RESOLV_CONF" 2>/dev/null; then
echo "nameserver $line" >> "$RESOLV_CONF"
log_info "添加 DNS 服务器: $line"
else
log_info "DNS 服务器已存在,跳过: $line"
fi
else
log_warning "跳过无效的 DNS 地址: $line"
fi
done
log_info "/etc/resolv.conf 已包含所有 DNS"
done < "$dns_conf_file"
# 设置文件权限
chmod 644 "$RESOLV_CONF"
log_success "/etc/resolv.conf 文件更新完成"
}
log_sync() { echo "[$(date '+%F %T')] $1" >>"$LOG_FILE"; }
# 记录同步日志
log_sync() {
local message="$1"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] $message" >> "$LOG_FILE"
}
# 主函数
main() {
log_info "开始 DNS 同步检查..."
mkdir -p /opt/argus-metric
log_sync "DNS 同步检查开始"
# 确保系统目录存在
mkdir -p "/opt/argus-metric"
# 获取 FTP 配置
get_ftp_config
local remote_file
if ! remote_file=$(download_remote_dns_conf); then
log_error "下载失败"; log_sync "同步失败"; exit 1
fi
# 检查本地 DNS 配置文件是否存在
if [[ ! -f "$LOCAL_DNS_CONF" ]]; then
log_info "本地 dns.conf 不存在,初始化..."
cp "$remote_file" "$LOCAL_DNS_CONF"
update_resolv_conf "$LOCAL_DNS_CONF"
log_sync "首次同步完成"
else
if compare_files "$LOCAL_DNS_CONF" "$remote_file"; then
log_info "dns.conf 无变化"
ensure_dns_in_resolv "$LOCAL_DNS_CONF"
log_sync "dns.conf 无变化,执行兜底检查"
else
log_info "检测到 DNS 配置更新"
log_warning "本地 DNS 配置文件不存在: $LOCAL_DNS_CONF"
log_warning "将下载远程配置文件并更新系统 DNS 设置"
# 下载远程配置文件
if remote_file=$(download_remote_dns_conf); then
# 复制到本地
cp "$remote_file" "$LOCAL_DNS_CONF"
log_success "远程 DNS 配置文件已保存到本地"
# 更新 resolv.conf
update_resolv_conf "$LOCAL_DNS_CONF"
log_sync "DNS 配置同步完成"
log_sync "首次同步完成DNS 配置已更新"
# 清理临时文件
rm -f "$remote_file"
else
log_error "无法下载远程 DNS 配置文件,同步失败"
log_sync "同步失败:无法下载远程配置文件"
exit 1
fi
else
log_info "本地 DNS 配置文件存在: $LOCAL_DNS_CONF"
# 下载远程配置文件进行比较
if remote_file=$(download_remote_dns_conf); then
# 比较文件
if compare_files "$LOCAL_DNS_CONF" "$remote_file"; then
log_info "DNS 配置文件无变化,无需更新"
log_sync "DNS 配置文件无变化"
else
log_info "检测到 DNS 配置文件有变化,开始同步..."
log_sync "检测到 DNS 配置文件变化,开始同步"
# 更新本地配置文件
cp "$remote_file" "$LOCAL_DNS_CONF"
log_success "本地 DNS 配置文件已更新"
# 更新 resolv.conf
update_resolv_conf "$LOCAL_DNS_CONF"
log_sync "DNS 配置同步完成"
fi
# 清理临时文件
rm -f "$remote_file"
else
log_error "无法下载远程 DNS 配置文件,跳过本次同步"
log_sync "同步失败:无法下载远程配置文件"
exit 1
fi
fi
rm -f "$remote_file"
log_success "DNS 同步流程完成"
log_success "DNS 同步检查完成"
log_sync "DNS 同步检查完成"
}
# 脚本入口
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi

View File

@ -1,59 +0,0 @@
# 客户侧组件安装包构建、发布流程
## 第一步:配置版本和组件
首先搞定配置文件:
1. 把 `.checklist.example` 重命名成 `checklist`
2. 把 `.VERSION.example` 重命名成 `VERSION`
### checklist 文件格式
```
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
dcgm-exporter-installer /path/to/dcgm-exporter-installer 1.1.0
node-exporter-installer /path/to/node-exporter-installer 1.1.0
```
### VERSION 文件
设置需要发布的版本号,比如 `1.29.0`
> 建议用 `version-manager.sh` 来管理版本
## 第二步:构建安装包
直接跑脚本:
```bash
./package_artifact.sh
```
构建完的东西会放在 `artifact/` 目录下,按版本分文件夹。
如果版本已经存在了,想要覆盖重新构建:
```bash
./package_artifact.sh --force
```
构建完可以手工测试安装包。
## 第三步:发布安装包
用这个脚本发布:
```bash
./publish_artifact.sh
```
发布后的内容在 `publish/` 目录里,包含:
- 压缩版本的安装包
- 一键安装的bash脚本
## 第四步部署到FTP服务器
把发布的内容上传到FTP服务器客户端就可以通过一键命令安装
```bash
curl -fsSL http://your-ftp-server/install.sh | sh -
curl -fsSL "ftp://ftpuser:{PASSWD}!@10.211.55.4/share/setup.sh" | sudo bash -s -- --server 10.211.55.4 --user ftpuser --password {PASSWD}
```
这样客户就能直接从FTP服务器下载并安装组件了。

View File

@ -1,3 +0,0 @@
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
dcgm-exporter-installer /Users/sundapeng/Project/nlp/aiops/client-plugins/dcgm-exporter-installer 1.1.0
node-exporter-installer /Users/sundapeng/Project/nlp/aiops/client-plugins/node-exporter-installer 1.1.0

Some files were not shown because too many files have changed in this diff Show More