Compare commits
20 Commits
main
...
dev_1.0.0_
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
2a45d40726 | ||
| 1a768bc837 | |||
|
|
bab771b5e4 | ||
|
|
9032608e79 | ||
| 31ccb0b1b8 | |||
|
|
2b367f9ba1 | ||
|
|
084c0a3719 | ||
| 8fbe107ac9 | |||
|
|
abc739b1be | ||
|
|
cb213df6f8 | ||
|
|
ade9dd7d62 | ||
|
|
54f99b854c | ||
|
|
ac15595c8e | ||
|
|
f17bc6d312 | ||
|
|
ac0eb558e9 | ||
|
|
ef89e5d7e6 | ||
| c098f1d3ce | |||
| 1e5e91b193 | |||
| 8a38d3d0b2 | |||
| 26e1c964ed |
1
.gitattributes
vendored
1
.gitattributes
vendored
@ -1 +0,0 @@
|
||||
src/metric/client-plugins/all-in-one-full/plugins/*/bin/* filter=lfs diff=lfs merge=lfs -text
|
||||
150
build/README.md
150
build/README.md
@ -1,150 +0,0 @@
|
||||
# ARGUS 统一构建脚本使用说明(build/build_images.sh)
|
||||
|
||||
本目录提供单一入口脚本 `build/build_images.sh`,覆盖常见三类场景:
|
||||
- 系统集成测试(src/sys/tests)
|
||||
- Swarm 系统集成测试(src/sys/swarm_tests)
|
||||
- 构建离线安装包(deployment_new:Server/Client‑GPU)
|
||||
|
||||
文档还说明 UID/GID 取值规则、镜像 tag 策略、常用参数与重试机制。
|
||||
|
||||
## 环境前置
|
||||
- Docker Engine ≥ 20.10(建议 ≥ 23.x/24.x)
|
||||
- Docker Compose v2(`docker compose` 子命令)
|
||||
- 可选:内网构建镜像源(`--intranet`)
|
||||
|
||||
## UID/GID 规则(用于容器内用户/卷属主)
|
||||
- 非 pkg 构建(core/master/metric/web/alert/sys/gpu_bundle/cpu_bundle):
|
||||
- 读取 `configs/build_user.local.conf` → `configs/build_user.conf`;
|
||||
- 可被环境变量覆盖:`ARGUS_BUILD_UID`、`ARGUS_BUILD_GID`;
|
||||
- pkg 构建(`--only server_pkg`、`--only client_pkg`):
|
||||
- 读取 `configs/build_user.pkg.conf`(优先)→ `build_user.local.conf` → `build_user.conf`;
|
||||
- 可被环境变量覆盖;
|
||||
- CPU bundle 明确走“非 pkg”链(不读取 `build_user.pkg.conf`)。
|
||||
- 说明:仅依赖 UID/GID 的 Docker 层会因参数变动而自动重建,不同构建剖面不会“打错包”。
|
||||
|
||||
## 镜像 tag 策略
|
||||
- 非 pkg 构建:默认输出 `:latest`。
|
||||
- `--only server_pkg`:所有镜像直接输出为 `:<VERSION>`(不覆盖 `:latest`)。
|
||||
- `--only client_pkg`:GPU bundle 仅输出 `:<VERSION>`(不覆盖 `:latest`)。
|
||||
- `--only cpu_bundle`:默认仅输出 `:<VERSION>`;可加 `--tag-latest` 同时打 `:latest` 以兼容 swarm_tests 默认 compose。
|
||||
|
||||
## 不加 --only 的默认构建目标
|
||||
不指定 `--only` 时,脚本会构建“基础镜像集合”(不含 bundle 与安装包):
|
||||
- core:`argus-elasticsearch:latest`、`argus-kibana:latest`、`argus-bind9:latest`
|
||||
- master:`argus-master:latest`(非 offline)
|
||||
- metric:`argus-metric-ftp:latest`、`argus-metric-prometheus:latest`、`argus-metric-grafana:latest`
|
||||
- web:`argus-web-frontend:latest`、`argus-web-proxy:latest`
|
||||
- alert:`argus-alertmanager:latest`
|
||||
- sys:`argus-sys-node:latest`、`argus-sys-metric-test-node:latest`、`argus-sys-metric-test-gpu-node:latest`
|
||||
|
||||
说明:默认 tag 为 `:latest`;UID/GID 走“非 pkg”链(`build_user.local.conf → build_user.conf`,可被环境变量覆盖)。
|
||||
|
||||
## 通用参数
|
||||
- `--intranet`:使用内网构建参数(各 Dockerfile 中按需启用)。
|
||||
- `--no-cache`:禁用 Docker 层缓存。
|
||||
- `--only <list>`:逗号分隔目标,例:`--only core,master,metric,web,alert`。
|
||||
- `--version YYMMDD`:bundle/pkg 的日期标签(必填于 cpu_bundle/gpu_bundle/server_pkg/client_pkg)。
|
||||
- `--client-semver X.Y.Z`:all‑in‑one‑full 客户端语义化版本(可选)。
|
||||
- `--cuda VER`:GPU bundle CUDA 基镜版本(默认 12.2.2)。
|
||||
- `--tag-latest`:CPU bundle 构建时同时打 `:latest`。
|
||||
|
||||
## 自动重试
|
||||
- 构建单镜像失败会自动重试(默认 3 次,间隔 5s)。
|
||||
- 最后一次自动使用 `DOCKER_BUILDKIT=0` 再试,缓解 “failed to receive status: context canceled”。
|
||||
- 可调:`ARGUS_BUILD_RETRIES`、`ARGUS_BUILD_RETRY_DELAY` 环境变量。
|
||||
|
||||
---
|
||||
|
||||
## 场景一:系统集成测试(src/sys/tests)
|
||||
构建用于系统级端到端测试的镜像(默认 `:latest`)。
|
||||
|
||||
示例:
|
||||
```
|
||||
# 构建核心与周边
|
||||
./build/build_images.sh --only core,master,metric,web,alert,sys
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-elasticsearch:latest`、`argus-kibana:latest`、`argus-master:latest`、`argus-metric-ftp:latest`、`argus-metric-prometheus:latest`、`argus-metric-grafana:latest`、`argus-alertmanager:latest`、`argus-web-frontend:latest`、`argus-web-proxy:latest`、`argus-sys-node:latest` 等。
|
||||
|
||||
说明:
|
||||
- UID/GID 读取 `build_user.local.conf → build_user.conf`(或环境变量覆盖)。
|
||||
- sys/tests 的执行见 `src/sys/tests/README.md`。
|
||||
|
||||
---
|
||||
|
||||
## 场景二:Swarm 系统集成测试(src/sys/swarm_tests)
|
||||
需要服务端镜像 + CPU 节点 bundle 镜像。
|
||||
|
||||
步骤:
|
||||
1) 构建服务端镜像(默认 `:latest`)
|
||||
```
|
||||
./build/build_images.sh --only core,master,metric,web,alert
|
||||
```
|
||||
2) 构建 CPU bundle(直接 FROM ubuntu:22.04)
|
||||
```
|
||||
# 仅版本 tag 输出
|
||||
./build/build_images.sh --only cpu_bundle --version 20251114
|
||||
# 若要兼容 swarm_tests 默认 latest:
|
||||
./build/build_images.sh --only cpu_bundle --version 20251114 --tag-latest
|
||||
```
|
||||
3) 运行 Swarm 测试
|
||||
```
|
||||
cd src/sys/swarm_tests
|
||||
# 如未打 latest,可先指定:
|
||||
export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:20251114
|
||||
./scripts/01_server_up.sh
|
||||
./scripts/02_wait_ready.sh
|
||||
./scripts/03_nodes_up.sh
|
||||
./scripts/04_metric_verify.sh # 验证 Prometheus/Grafana/nodes.json 与日志通路
|
||||
./scripts/99_down.sh # 结束
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-*:latest` 与 `argus-sys-metric-test-node-bundle:20251114`(或 latest)。
|
||||
- `swarm_tests/private-*`:运行态持久化文件。
|
||||
|
||||
说明:
|
||||
- CPU bundle 构建用户走“非 pkg”链(local.conf → conf)。
|
||||
- `04_metric_verify.sh` 已内置 Fluent Bit 启动与配置修正逻辑,偶发未就绪可重跑一次即通过。
|
||||
|
||||
---
|
||||
|
||||
## 场景三:构建离线安装包(deployment_new)
|
||||
Server 与 Client‑GPU 安装包均采用“版本直出”,只输出 `:<VERSION>` 标签,不改动 `:latest`。
|
||||
|
||||
1) Server 包
|
||||
```
|
||||
./build/build_images.sh --only server_pkg --version 20251114
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-<模块>:20251114`(不触碰 latest)。
|
||||
- 安装包:`deployment_new/artifact/server/20251114/` 与 `server_20251114.tar.gz`
|
||||
- 包内包含:逐镜像 tar.gz、compose/.env.example、scripts(config/install/selfcheck/diagnose 等)、docs、manifest/checksums。
|
||||
|
||||
2) Client‑GPU 包
|
||||
```
|
||||
# 同步构建 GPU bundle(仅 :<VERSION>,不触碰 latest),并生成客户端包
|
||||
./build/build_images.sh --only client_pkg --version 20251114 \\
|
||||
--client-semver 1.44.0 --cuda 12.2.2
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-sys-metric-test-node-bundle-gpu:20251114`
|
||||
- 安装包:`deployment_new/artifact/client_gpu/20251114/` 与 `client_gpu_20251114.tar.gz`
|
||||
- 包内包含:GPU bundle 镜像 tar.gz、busybox.tar、compose/.env.example、scripts(config/install/uninstall)、docs、manifest/checksums。
|
||||
|
||||
说明:
|
||||
- pkg 构建使用 `configs/build_user.pkg.conf` 的 UID/GID(可被环境覆盖)。
|
||||
- 包内 `.env.example` 的 `PKG_VERSION=<VERSION>` 与镜像 tag 严格一致。
|
||||
|
||||
---
|
||||
|
||||
## 常见问题(FAQ)
|
||||
- 构建报 `failed to receive status: context canceled`?
|
||||
- 已内置单镜像多次重试,最后一次禁用 BuildKit;建议加 `--intranet` 与 `--no-cache` 重试,或 `docker builder prune -f` 后再试。
|
||||
- 先跑非 pkg(latest),再跑 pkg(version)会不会“打错包”?
|
||||
- 不会。涉及 UID/GID 的层因参数变化会重建,其它层按缓存命中复用,最终 pkg 产物的属主与运行账户按 `build_user.pkg.conf` 生效。
|
||||
- swarm_tests 默认拉取 `:latest`,我只构建了 `:<VERSION>` 的 CPU bundle 怎么办?
|
||||
- 在运行前 `export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:<VERSION>`,或在构建时加 `--tag-latest`。
|
||||
|
||||
---
|
||||
|
||||
如需进一步自动化(例如生成 BUILD_SUMMARY.txt 汇总镜像 digest 与构建参数),可在 pkg 产出阶段追加,我可以按需补齐。
|
||||
@ -10,46 +10,22 @@ Usage: $0 [OPTIONS]
|
||||
Options:
|
||||
--intranet Use intranet mirror for log/bind builds
|
||||
--master-offline Build master offline image (requires src/master/offline_wheels.tar.gz)
|
||||
--metric Build metric module images (ftp, prometheus, grafana, test nodes)
|
||||
--no-cache Build all images without using Docker layer cache
|
||||
--only LIST Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
|
||||
--version DATE Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
|
||||
--client-semver X.Y.Z Override client semver used in all-in-one-full artifact (optional)
|
||||
--cuda VER CUDA runtime version for NVIDIA base (default: 12.2.2)
|
||||
--tag-latest Also tag bundle image as :latest (for cpu_bundle only; default off)
|
||||
-h, --help Show this help message
|
||||
|
||||
Examples:
|
||||
$0 # Build with default sources
|
||||
$0 --intranet # Build with intranet mirror
|
||||
$0 --master-offline # Additionally build argus-master:offline
|
||||
$0 --metric # Additionally build metric module images
|
||||
$0 --intranet --master-offline --metric
|
||||
$0 --intranet --master-offline
|
||||
EOF
|
||||
}
|
||||
|
||||
use_intranet=false
|
||||
build_core=true
|
||||
build_master=true
|
||||
build_master_offline=false
|
||||
build_metric=true
|
||||
build_web=true
|
||||
build_alert=true
|
||||
build_sys=true
|
||||
build_gpu_bundle=false
|
||||
build_cpu_bundle=false
|
||||
build_server_pkg=false
|
||||
build_client_pkg=false
|
||||
need_bind_image=true
|
||||
need_metric_ftp=true
|
||||
no_cache=false
|
||||
|
||||
bundle_date=""
|
||||
client_semver=""
|
||||
cuda_ver="12.2.2"
|
||||
DEFAULT_IMAGE_TAG="latest"
|
||||
tag_latest=false
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--intranet)
|
||||
@ -65,55 +41,10 @@ while [[ $# -gt 0 ]]; do
|
||||
build_master_offline=true
|
||||
shift
|
||||
;;
|
||||
--metric)
|
||||
build_metric=true
|
||||
shift
|
||||
;;
|
||||
--no-cache)
|
||||
no_cache=true
|
||||
shift
|
||||
;;
|
||||
--only)
|
||||
if [[ -z ${2:-} ]]; then
|
||||
echo "--only requires a target list" >&2; exit 1
|
||||
fi
|
||||
sel="$2"; shift 2
|
||||
# reset all, then enable selected
|
||||
build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
|
||||
IFS=',' read -ra parts <<< "$sel"
|
||||
for p in "${parts[@]}"; do
|
||||
case "$p" in
|
||||
core) build_core=true ;;
|
||||
master) build_master=true ;;
|
||||
metric) build_metric=true ;;
|
||||
web) build_web=true ;;
|
||||
alert) build_alert=true ;;
|
||||
sys) build_sys=true ;;
|
||||
gpu_bundle) build_gpu_bundle=true ;;
|
||||
cpu_bundle) build_cpu_bundle=true ;;
|
||||
server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
|
||||
client_pkg) build_client_pkg=true ;;
|
||||
all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
|
||||
*) echo "Unknown --only target: $p" >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
;;
|
||||
--version)
|
||||
if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
|
||||
bundle_date="$2"; shift 2
|
||||
;;
|
||||
--client-semver)
|
||||
if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
|
||||
client_semver="$2"; shift 2
|
||||
;;
|
||||
--cuda)
|
||||
if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
|
||||
cuda_ver="$2"; shift 2
|
||||
;;
|
||||
--tag-latest)
|
||||
tag_latest=true
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
show_help
|
||||
exit 0
|
||||
@ -126,11 +57,6 @@ while [[ $# -gt 0 ]]; do
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ "$build_server_pkg" == true ]]; then
|
||||
need_bind_image=false
|
||||
need_metric_ftp=false
|
||||
fi
|
||||
|
||||
root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
. "$root/scripts/common/build_user.sh"
|
||||
|
||||
@ -142,16 +68,6 @@ fi
|
||||
|
||||
cd "$root"
|
||||
|
||||
# Set default image tag policy before building
|
||||
if [[ "$build_server_pkg" == true ]]; then
|
||||
DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
|
||||
fi
|
||||
|
||||
# Select build user profile for pkg vs default
|
||||
if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
|
||||
export ARGUS_BUILD_PROFILE=pkg
|
||||
fi
|
||||
|
||||
load_build_user
|
||||
build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
|
||||
|
||||
@ -199,459 +115,45 @@ build_image() {
|
||||
local image_name=$1
|
||||
local dockerfile_path=$2
|
||||
local tag=$3
|
||||
local context="."
|
||||
shift 3
|
||||
|
||||
if [[ $# -gt 0 ]]; then
|
||||
context=$1
|
||||
shift
|
||||
fi
|
||||
|
||||
local extra_args=("$@")
|
||||
|
||||
echo "🔄 Building $image_name image..."
|
||||
echo " Dockerfile: $dockerfile_path"
|
||||
echo " Tag: $tag"
|
||||
echo " Context: $context"
|
||||
|
||||
local tries=${ARGUS_BUILD_RETRIES:-3}
|
||||
local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
|
||||
local attempt=1
|
||||
while (( attempt <= tries )); do
|
||||
local prefix=""
|
||||
if (( attempt == tries )); then
|
||||
# final attempt: disable BuildKit to avoid docker/dockerfile front-end pulls
|
||||
prefix="DOCKER_BUILDKIT=0"
|
||||
echo " Attempt ${attempt}/${tries} (fallback: DOCKER_BUILDKIT=0)"
|
||||
else
|
||||
echo " Attempt ${attempt}/${tries}"
|
||||
fi
|
||||
if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
|
||||
echo "✅ $image_name image built successfully"
|
||||
return 0
|
||||
fi
|
||||
echo "⚠️ Build failed for $image_name (attempt ${attempt}/${tries})."
|
||||
if (( attempt < tries )); then
|
||||
echo " Retrying in ${delay}s..."
|
||||
sleep "$delay"
|
||||
fi
|
||||
attempt=$((attempt+1))
|
||||
done
|
||||
echo "❌ Failed to build $image_name image after ${tries} attempts"
|
||||
return 1
|
||||
}
|
||||
|
||||
pull_base_image() {
|
||||
local image_ref=$1
|
||||
local attempts=${2:-3}
|
||||
local delay=${3:-5}
|
||||
|
||||
# If the image already exists locally, skip pulling.
|
||||
if docker image inspect "$image_ref" >/dev/null 2>&1; then
|
||||
echo " Local image present; skip pull: $image_ref"
|
||||
if docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" .; then
|
||||
echo "✅ $image_name image built successfully"
|
||||
return 0
|
||||
else
|
||||
echo "❌ Failed to build $image_name image"
|
||||
return 1
|
||||
fi
|
||||
|
||||
for ((i=1; i<=attempts; i++)); do
|
||||
echo " Pulling base image ($i/$attempts): $image_ref"
|
||||
if docker pull "$image_ref" >/dev/null; then
|
||||
echo " Base image ready: $image_ref"
|
||||
return 0
|
||||
fi
|
||||
echo " Pull failed: $image_ref"
|
||||
if (( i < attempts )); then
|
||||
echo " Retrying in ${delay}s..."
|
||||
sleep "$delay"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "❌ Unable to pull base image after ${attempts} attempts: $image_ref"
|
||||
return 1
|
||||
}
|
||||
|
||||
images_built=()
|
||||
build_failed=false
|
||||
|
||||
build_gpu_bundle_image() {
|
||||
local date_tag="$1" # e.g. 20251112
|
||||
local cuda_ver_local="$2" # e.g. 12.2.2
|
||||
local client_ver="$3" # semver like 1.43.0
|
||||
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:latest"; then
|
||||
images_built+=("argus-elasticsearch:latest")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
|
||||
return 1
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# sanitize cuda version (trim trailing dots like '12.2.')
|
||||
while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
|
||||
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:latest"; then
|
||||
images_built+=("argus-kibana:latest")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
# Resolve effective CUDA base tag
|
||||
local resolve_cuda_base_tag
|
||||
resolve_cuda_base_tag() {
|
||||
local want="$1" # can be 12, 12.2 or 12.2.2
|
||||
local major minor patch
|
||||
if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
|
||||
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
|
||||
echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
|
||||
elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
|
||||
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
|
||||
# try to find best local patch for major.minor
|
||||
local best
|
||||
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
|
||||
grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
|
||||
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
|
||||
sort -V | tail -n1 || true)
|
||||
if [[ -n "$best" ]]; then
|
||||
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
|
||||
fi
|
||||
# fallback patch if none local
|
||||
echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
|
||||
elif [[ "$want" =~ ^([0-9]+)$ ]]; then
|
||||
major="${BASH_REMATCH[1]}"
|
||||
# try to find best local for this major
|
||||
local best
|
||||
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
|
||||
grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
|
||||
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
|
||||
sort -V | tail -n1 || true)
|
||||
if [[ -n "$best" ]]; then
|
||||
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
|
||||
fi
|
||||
echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
|
||||
else
|
||||
# invalid format, fallback to default
|
||||
echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
|
||||
fi
|
||||
}
|
||||
echo ""
|
||||
|
||||
local base_image
|
||||
base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
|
||||
|
||||
echo
|
||||
echo "🔧 Preparing one-click GPU bundle build"
|
||||
echo " CUDA runtime base: ${base_image}"
|
||||
echo " Bundle tag : ${date_tag}"
|
||||
|
||||
# 1) Ensure NVIDIA base image (skip pull if local)
|
||||
if ! pull_base_image "$base_image"; then
|
||||
# try once more with default if resolution failed
|
||||
if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
|
||||
return 1
|
||||
else
|
||||
base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 2) Build latest argus-agent from source
|
||||
echo "\n🛠 Building argus-agent from src/agent"
|
||||
pushd "$root/src/agent" >/dev/null
|
||||
if ! bash scripts/build_binary.sh; then
|
||||
echo "❌ argus-agent build failed" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
if [[ ! -f "dist/argus-agent" ]]; then
|
||||
echo "❌ argus-agent binary missing after build" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
popd >/dev/null
|
||||
|
||||
# 3) Inject agent into all-in-one-full plugin and package artifact
|
||||
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
|
||||
local agent_bin_src="$root/src/agent/dist/argus-agent"
|
||||
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
|
||||
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
|
||||
cp -f "$agent_bin_src" "$agent_bin_dst"
|
||||
chmod +x "$agent_bin_dst" || true
|
||||
|
||||
pushd "$aio_root" >/dev/null
|
||||
local prev_version
|
||||
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
|
||||
local use_version="$prev_version"
|
||||
if [[ -n "$client_semver" ]]; then
|
||||
echo "${client_semver}" > config/VERSION
|
||||
use_version="$client_semver"
|
||||
fi
|
||||
echo " Packaging all-in-one-full artifact version: $use_version"
|
||||
if ! bash scripts/package_artifact.sh --force; then
|
||||
echo "❌ package_artifact.sh failed" >&2
|
||||
# restore VERSION if changed
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
|
||||
local artifact_dir="$aio_root/artifact/$use_version"
|
||||
local artifact_tar
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
if [[ -z "$artifact_tar" ]]; then
|
||||
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
|
||||
local owner="$(id -u):$(id -g)"
|
||||
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
|
||||
echo "❌ publish_artifact.sh failed" >&2
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
fi
|
||||
if [[ -z "$artifact_tar" ]]; then
|
||||
echo "❌ artifact tar not found under $artifact_dir" >&2
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
# restore VERSION if changed (keep filesystem clean)
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
|
||||
# 4) Stage docker build context
|
||||
local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
|
||||
echo "\n🧰 Staging docker build context: $bundle_ctx"
|
||||
rm -rf "$bundle_ctx"
|
||||
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
|
||||
cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
|
||||
# bundle tar
|
||||
cp "$artifact_tar" "$bundle_ctx/bundle/"
|
||||
# offline fluent-bit assets (optional but useful)
|
||||
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
|
||||
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
|
||||
fi
|
||||
|
||||
# 5) Build the final bundle image (directly from NVIDIA base)
|
||||
local image_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
|
||||
echo "\n🔄 Building GPU Bundle image"
|
||||
if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
|
||||
--build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
|
||||
--build-arg CLIENT_VER="$use_version" \
|
||||
--build-arg BUNDLE_DATE="$date_tag"; then
|
||||
images_built+=("$image_tag")
|
||||
# In non-pkg mode, also tag latest for convenience
|
||||
if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
|
||||
docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu:latest >/dev/null 2>&1 || true
|
||||
fi
|
||||
return 0
|
||||
else
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Tag helper: ensure :<date_tag> exists for a list of repos
|
||||
ensure_version_tags() {
|
||||
local date_tag="$1"; shift
|
||||
local repos=("$@")
|
||||
for repo in "${repos[@]}"; do
|
||||
if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
|
||||
:
|
||||
elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
|
||||
docker tag "$repo:latest" "$repo:$date_tag" || true
|
||||
else
|
||||
echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
|
||||
return 1
|
||||
fi
|
||||
done
|
||||
return 0
|
||||
}
|
||||
|
||||
# Build server package after images are built
|
||||
build_server_pkg_bundle() {
|
||||
local date_tag="$1"
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ server_pkg requires --version YYMMDD" >&2
|
||||
return 1
|
||||
fi
|
||||
local repos=(
|
||||
argus-master argus-elasticsearch argus-kibana \
|
||||
argus-metric-prometheus argus-metric-grafana \
|
||||
argus-alertmanager argus-web-frontend argus-web-proxy
|
||||
)
|
||||
echo "\n🔖 Verifying server images with :$date_tag and collecting digests (Bind/FTP excluded; relying on Docker DNS aliases)"
|
||||
for repo in "${repos[@]}"; do
|
||||
if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
|
||||
echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
|
||||
return 1
|
||||
fi
|
||||
done
|
||||
# Optional: show digests
|
||||
for repo in "${repos[@]}"; do
|
||||
local digest
|
||||
digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
|
||||
printf ' • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
|
||||
done
|
||||
echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
|
||||
if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
|
||||
echo "❌ make_server_package.sh failed" >&2
|
||||
return 1
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
# Build client package: ensure gpu bundle image exists, then package client_gpu
|
||||
build_client_pkg_bundle() {
|
||||
local date_tag="$1"
|
||||
local semver="$2"
|
||||
local cuda="$3"
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ client_pkg requires --version YYMMDD" >&2
|
||||
return 1
|
||||
fi
|
||||
local bundle_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
|
||||
if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
|
||||
echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
|
||||
ARGUS_PKG_BUILD=1
|
||||
export ARGUS_PKG_BUILD
|
||||
if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
|
||||
return 1
|
||||
fi
|
||||
else
|
||||
echo "\n✅ Using existing GPU bundle image: $bundle_tag"
|
||||
fi
|
||||
echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
|
||||
if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
|
||||
echo "❌ make_client_gpu_package.sh failed" >&2
|
||||
return 1
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
# Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
|
||||
build_cpu_bundle_image() {
|
||||
local date_tag="$1" # e.g. 20251113
|
||||
local client_ver_in="$2" # semver like 1.43.0 (optional)
|
||||
local want_tag_latest="$3" # true/false
|
||||
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ cpu_bundle requires --version YYMMDD" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
echo "\n🔧 Preparing one-click CPU bundle build"
|
||||
echo " Base: ubuntu:22.04"
|
||||
echo " Bundle tag: ${date_tag}"
|
||||
|
||||
# 1) Build latest argus-agent from source
|
||||
echo "\n🛠 Building argus-agent from src/agent"
|
||||
pushd "$root/src/agent" >/dev/null
|
||||
if ! bash scripts/build_binary.sh; then
|
||||
echo "❌ argus-agent build failed" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
if [[ ! -f "dist/argus-agent" ]]; then
|
||||
echo "❌ argus-agent binary missing after build" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
popd >/dev/null
|
||||
|
||||
# 2) Inject agent into all-in-one-full plugin and package artifact
|
||||
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
|
||||
local agent_bin_src="$root/src/agent/dist/argus-agent"
|
||||
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
|
||||
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
|
||||
cp -f "$agent_bin_src" "$agent_bin_dst"
|
||||
chmod +x "$agent_bin_dst" || true
|
||||
|
||||
pushd "$aio_root" >/dev/null
|
||||
local prev_version use_version
|
||||
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
|
||||
use_version="$prev_version"
|
||||
if [[ -n "$client_ver_in" ]]; then
|
||||
echo "$client_ver_in" > config/VERSION
|
||||
use_version="$client_ver_in"
|
||||
fi
|
||||
echo " Packaging all-in-one-full artifact: version=$use_version"
|
||||
if ! bash scripts/package_artifact.sh --force; then
|
||||
echo "❌ package_artifact.sh failed" >&2
|
||||
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
local artifact_dir="$aio_root/artifact/$use_version"
|
||||
local artifact_tar
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
if [[ -z "$artifact_tar" ]]; then
|
||||
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
|
||||
local owner="$(id -u):$(id -g)"
|
||||
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
|
||||
echo "❌ publish_artifact.sh failed" >&2
|
||||
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
fi
|
||||
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
|
||||
popd >/dev/null
|
||||
|
||||
# 3) Stage docker build context
|
||||
local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
|
||||
echo "\n🧰 Staging docker build context: $bundle_ctx"
|
||||
rm -rf "$bundle_ctx"
|
||||
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
|
||||
cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
|
||||
# bundle tar
|
||||
cp "$artifact_tar" "$bundle_ctx/bundle/"
|
||||
# offline fluent-bit assets
|
||||
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
|
||||
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
|
||||
fi
|
||||
|
||||
# 4) Build final bundle image
|
||||
local image_tag="argus-sys-metric-test-node-bundle:${date_tag}"
|
||||
echo "\n🔄 Building CPU Bundle image"
|
||||
if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
|
||||
images_built+=("$image_tag")
|
||||
if [[ "$want_tag_latest" == "true" ]]; then
|
||||
docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
|
||||
fi
|
||||
return 0
|
||||
else
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
if [[ "$build_core" == true ]]; then
|
||||
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:${DEFAULT_IMAGE_TAG}"; then
|
||||
images_built+=("argus-elasticsearch:${DEFAULT_IMAGE_TAG}")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
|
||||
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:${DEFAULT_IMAGE_TAG}"; then
|
||||
images_built+=("argus-kibana:${DEFAULT_IMAGE_TAG}")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
|
||||
if [[ "$need_bind_image" == true ]]; then
|
||||
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:${DEFAULT_IMAGE_TAG}"; then
|
||||
images_built+=("argus-bind9:${DEFAULT_IMAGE_TAG}")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:latest"; then
|
||||
images_built+=("argus-bind9:latest")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
@ -660,7 +162,7 @@ if [[ "$build_master" == true ]]; then
|
||||
echo ""
|
||||
echo "🔄 Building Master image..."
|
||||
pushd "$master_root" >/dev/null
|
||||
master_args=("--tag" "argus-master:${DEFAULT_IMAGE_TAG}")
|
||||
master_args=("--tag" "argus-master:latest")
|
||||
if [[ "$use_intranet" == true ]]; then
|
||||
master_args+=("--intranet")
|
||||
fi
|
||||
@ -674,7 +176,7 @@ if [[ "$build_master" == true ]]; then
|
||||
if [[ "$build_master_offline" == true ]]; then
|
||||
images_built+=("argus-master:offline")
|
||||
else
|
||||
images_built+=("argus-master:${DEFAULT_IMAGE_TAG}")
|
||||
images_built+=("argus-master:latest")
|
||||
fi
|
||||
else
|
||||
build_failed=true
|
||||
@ -682,176 +184,6 @@ if [[ "$build_master" == true ]]; then
|
||||
popd >/dev/null
|
||||
fi
|
||||
|
||||
if [[ "$build_metric" == true ]]; then
|
||||
echo ""
|
||||
echo "Building Metric module images..."
|
||||
|
||||
metric_base_images=(
|
||||
"ubuntu/prometheus:3-24.04_stable"
|
||||
"grafana/grafana:11.1.0"
|
||||
)
|
||||
|
||||
if [[ "$need_metric_ftp" == true ]]; then
|
||||
metric_base_images+=("ubuntu:22.04")
|
||||
fi
|
||||
|
||||
for base_image in "${metric_base_images[@]}"; do
|
||||
if ! pull_base_image "$base_image"; then
|
||||
build_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
metric_builds=()
|
||||
if [[ "$need_metric_ftp" == true ]]; then
|
||||
metric_builds+=("Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build")
|
||||
fi
|
||||
metric_builds+=(
|
||||
"Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
|
||||
"Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
|
||||
)
|
||||
|
||||
for build_spec in "${metric_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# Sys (system tests) node images
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_sys" == true ]]; then
|
||||
echo ""
|
||||
echo "Building Sys node images..."
|
||||
|
||||
sys_base_images=(
|
||||
"ubuntu:22.04"
|
||||
"nvidia/cuda:12.2.2-runtime-ubuntu22.04"
|
||||
)
|
||||
|
||||
for base_image in "${sys_base_images[@]}"; do
|
||||
if ! pull_base_image "$base_image"; then
|
||||
build_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
sys_builds=(
|
||||
"Sys Node|src/sys/build/node/Dockerfile|argus-sys-node:latest|."
|
||||
"Sys Metric Test Node|src/sys/build/test-node/Dockerfile|argus-sys-metric-test-node:latest|."
|
||||
"Sys Metric Test GPU Node|src/sys/build/test-gpu-node/Dockerfile|argus-sys-metric-test-gpu-node:latest|."
|
||||
)
|
||||
|
||||
for build_spec in "${sys_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# Web & Alert module images
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_web" == true || "$build_alert" == true ]]; then
|
||||
echo ""
|
||||
echo "Building Web and Alert module images..."
|
||||
|
||||
# Pre-pull commonly used base images for stability
|
||||
web_alert_base_images=(
|
||||
"node:20"
|
||||
"ubuntu:24.04"
|
||||
)
|
||||
|
||||
for base_image in "${web_alert_base_images[@]}"; do
|
||||
if ! pull_base_image "$base_image"; then
|
||||
build_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ "$build_web" == true ]]; then
|
||||
web_builds=(
|
||||
"Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:${DEFAULT_IMAGE_TAG}|."
|
||||
"Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:${DEFAULT_IMAGE_TAG}|."
|
||||
)
|
||||
for build_spec in "${web_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
|
||||
if [[ "$build_alert" == true ]]; then
|
||||
alert_builds=(
|
||||
"Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:${DEFAULT_IMAGE_TAG}|."
|
||||
)
|
||||
for build_spec in "${alert_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# One-click GPU bundle (direct NVIDIA base)
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_gpu_bundle" == true ]]; then
|
||||
echo ""
|
||||
echo "Building one-click GPU bundle image..."
|
||||
if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# One-click CPU bundle (from ubuntu:22.04)
|
||||
# =======================================
|
||||
if [[ "$build_cpu_bundle" == true ]]; then
|
||||
echo ""
|
||||
echo "Building one-click CPU bundle image..."
|
||||
if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# One-click Server/Client packaging
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_server_pkg" == true ]]; then
|
||||
echo ""
|
||||
echo "🧳 Building one-click Server package..."
|
||||
if ! build_server_pkg_bundle "${bundle_date}"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "$build_client_pkg" == true ]]; then
|
||||
echo ""
|
||||
echo "🧳 Building one-click Client-GPU package..."
|
||||
if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "======================================="
|
||||
echo "📦 Build Summary"
|
||||
echo "======================================="
|
||||
@ -878,6 +210,7 @@ if [[ "$build_master_offline" == true ]]; then
|
||||
echo ""
|
||||
echo "🧳 Master offline wheels 已解压到 $master_offline_dir"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "🚀 Next steps:"
|
||||
echo " ./build/save_images.sh --compress # 导出镜像"
|
||||
|
||||
@ -68,12 +68,6 @@ declare -A images=(
|
||||
["argus-kibana:latest"]="argus-kibana-latest.tar"
|
||||
["argus-bind9:latest"]="argus-bind9-latest.tar"
|
||||
["argus-master:offline"]="argus-master-offline.tar"
|
||||
["argus-metric-ftp:latest"]="argus-metric-ftp-latest.tar"
|
||||
["argus-metric-prometheus:latest"]="argus-metric-prometheus-latest.tar"
|
||||
["argus-metric-grafana:latest"]="argus-metric-grafana-latest.tar"
|
||||
["argus-web-frontend:latest"]="argus-web-frontend-latest.tar"
|
||||
["argus-web-proxy:latest"]="argus-web-proxy-latest.tar"
|
||||
["argus-alertmanager:latest"]="argus-alertmanager-latest.tar"
|
||||
)
|
||||
|
||||
# 函数:检查镜像是否存在
|
||||
@ -226,4 +220,4 @@ fi
|
||||
|
||||
echo ""
|
||||
echo "✅ Image export completed successfully!"
|
||||
echo ""
|
||||
echo ""
|
||||
@ -1,6 +0,0 @@
|
||||
# Default build-time UID/GID for Argus images
|
||||
# Override by creating configs/build_user.local.conf with the same format.
|
||||
# Syntax: KEY=VALUE, supports UID/GID only. Whitespace and lines starting with # are ignored.
|
||||
|
||||
UID=2133
|
||||
GID=2015
|
||||
1
deployment_new/.gitignore
vendored
1
deployment_new/.gitignore
vendored
@ -1 +0,0 @@
|
||||
artifact/
|
||||
@ -1,14 +0,0 @@
|
||||
# deployment_new
|
||||
|
||||
本目录用于新的部署打包与交付实现(不影响既有 `deployment/`)。
|
||||
|
||||
里程碑 M1(当前实现)
|
||||
- `build/make_server_package.sh`:生成 Server 包(逐服务镜像 tar.gz、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz)。
|
||||
- `build/make_client_gpu_package.sh`:生成 Client‑GPU 包(GPU bundle 镜像 tar.gz、busybox.tar、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz)。
|
||||
|
||||
模板
|
||||
- `templates/server/compose/docker-compose.yml`:部署专用,镜像默认使用 `:${PKG_VERSION}` 版本 tag,可通过 `.env` 覆盖。
|
||||
- `templates/client_gpu/compose/docker-compose.yml`:GPU 节点专用,使用 `:${PKG_VERSION}` 版本 tag。
|
||||
|
||||
注意:M1 仅产出安装包,不包含安装脚本落地;安装/运维脚本将在 M2 落地并纳入包内。
|
||||
|
||||
@ -1,33 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
log() { echo -e "\033[0;34m[INFO]\033[0m $*"; }
|
||||
warn() { echo -e "\033[1;33m[WARN]\033[0m $*"; }
|
||||
err() { echo -e "\033[0;31m[ERR ]\033[0m $*" >&2; }
|
||||
|
||||
require_cmd() {
|
||||
local miss=0
|
||||
for c in "$@"; do
|
||||
if ! command -v "$c" >/dev/null 2>&1; then err "missing command: $c"; miss=1; fi
|
||||
done
|
||||
[[ $miss -eq 0 ]]
|
||||
}
|
||||
|
||||
today_version() { date +%Y%m%d; }
|
||||
|
||||
checksum_dir() {
|
||||
local dir="$1"; local out="$2"; : > "$out";
|
||||
(cd "$dir" && find . -type f -print0 | sort -z | xargs -0 sha256sum) >> "$out"
|
||||
}
|
||||
|
||||
make_dir() { mkdir -p "$1"; }
|
||||
|
||||
copy_tree() {
|
||||
local src="$1" dst="$2"; rsync -a --delete "$src/" "$dst/" 2>/dev/null || cp -r "$src/." "$dst/";
|
||||
}
|
||||
|
||||
gen_manifest() {
|
||||
local root="$1"; local out="$2"; : > "$out";
|
||||
(cd "$root" && find . -maxdepth 4 -type f -printf "%p\n" | sort) >> "$out"
|
||||
}
|
||||
|
||||
@ -1,131 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Make client GPU package (versioned gpu bundle image, compose, env, docs, busybox)
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/client_gpu"
|
||||
ART_ROOT="$ROOT_DIR/deployment_new/artifact/client_gpu"
|
||||
|
||||
# Use deployment_new local common helpers
|
||||
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
|
||||
. "$COMMON_SH"
|
||||
|
||||
usage(){ cat <<EOF
|
||||
Build Client-GPU Package (deployment_new)
|
||||
|
||||
Usage: $(basename "$0") --version YYYYMMDD [--image IMAGE[:TAG]]
|
||||
|
||||
Defaults:
|
||||
image = argus-sys-metric-test-node-bundle-gpu:latest
|
||||
|
||||
Outputs: deployment_new/artifact/client_gpu/<YYYYMMDD>/ and client_gpu_YYYYMMDD.tar.gz
|
||||
EOF
|
||||
}
|
||||
|
||||
VERSION=""
|
||||
IMAGE="argus-sys-metric-test-node-bundle-gpu:latest"
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--version) VERSION="$2"; shift 2;;
|
||||
--image) IMAGE="$2"; shift 2;;
|
||||
-h|--help) usage; exit 0;;
|
||||
*) err "unknown arg: $1"; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
|
||||
|
||||
require_cmd docker tar gzip
|
||||
|
||||
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
|
||||
PKG_DIR="$ART_ROOT/$VERSION"
|
||||
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
|
||||
|
||||
# 1) Save GPU bundle image with version tag
|
||||
if ! docker image inspect "$IMAGE" >/dev/null 2>&1; then
|
||||
err "missing image: $IMAGE"; exit 1; fi
|
||||
|
||||
REPO="${IMAGE%%:*}"; TAG_VER="$REPO:$VERSION"
|
||||
docker tag "$IMAGE" "$TAG_VER"
|
||||
out_tar="$STAGE/images/${REPO//\//-}-$VERSION.tar"
|
||||
docker save -o "$out_tar" "$TAG_VER"
|
||||
gzip -f "$out_tar"
|
||||
|
||||
# 2) Busybox tar for connectivity/overlay warmup (prefer local template; fallback to docker save)
|
||||
BB_SRC="$TEMPL_DIR/images/busybox.tar"
|
||||
if [[ -f "$BB_SRC" ]]; then
|
||||
cp "$BB_SRC" "$STAGE/images/busybox.tar"
|
||||
else
|
||||
if docker image inspect busybox:latest >/dev/null 2>&1 || docker pull busybox:latest >/dev/null 2>&1; then
|
||||
docker save -o "$STAGE/images/busybox.tar" busybox:latest
|
||||
log "Included busybox from local docker daemon"
|
||||
else
|
||||
warn "busybox image not found and cannot pull; skipping busybox.tar"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3) Compose + env template and docs/scripts from templates
|
||||
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
|
||||
ENV_EX="$STAGE/compose/.env.example"
|
||||
cat >"$ENV_EX" <<EOF
|
||||
# Generated by make_client_gpu_package.sh
|
||||
PKG_VERSION=$VERSION
|
||||
|
||||
NODE_GPU_BUNDLE_IMAGE_TAG=${REPO}:${VERSION}
|
||||
|
||||
# Compose project name (isolation from server stack)
|
||||
COMPOSE_PROJECT_NAME=argus-client
|
||||
|
||||
# Required (no defaults). Must be filled before install.
|
||||
AGENT_ENV=
|
||||
AGENT_USER=
|
||||
AGENT_INSTANCE=
|
||||
GPU_NODE_HOSTNAME=
|
||||
|
||||
# Overlay network (should match server包 overlay)
|
||||
ARGUS_OVERLAY_NET=argus-sys-net
|
||||
|
||||
# From cluster-info.env (server package output)
|
||||
SWARM_MANAGER_ADDR=
|
||||
SWARM_JOIN_TOKEN_WORKER=
|
||||
SWARM_JOIN_TOKEN_MANAGER=
|
||||
EOF
|
||||
|
||||
# 4) Docs from deployment_new templates
|
||||
CLIENT_DOC_SRC="$TEMPL_DIR/docs"
|
||||
if [[ -d "$CLIENT_DOC_SRC" ]]; then
|
||||
rsync -a "$CLIENT_DOC_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$CLIENT_DOC_SRC/." "$STAGE/docs/"
|
||||
fi
|
||||
|
||||
# Placeholder scripts (will be implemented in M2)
|
||||
cat >"$STAGE/scripts/README.md" <<'EOF'
|
||||
# Client-GPU Scripts (Placeholder)
|
||||
|
||||
本目录将在 M2 引入:
|
||||
- config.sh / install.sh
|
||||
|
||||
当前为占位,便于包结构审阅。
|
||||
EOF
|
||||
|
||||
# 5) Scripts (from deployment_new templates) and Private skeleton
|
||||
SCRIPTS_SRC="$TEMPL_DIR/scripts"
|
||||
if [[ -d "$SCRIPTS_SRC" ]]; then
|
||||
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
|
||||
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
|
||||
fi
|
||||
mkdir -p "$STAGE/private/argus/agent"
|
||||
|
||||
# 6) Manifest & checksums
|
||||
gen_manifest "$STAGE" "$STAGE/manifest.txt"
|
||||
checksum_dir "$STAGE" "$STAGE/checksums.txt"
|
||||
|
||||
# 7) Move to artifact dir and pack
|
||||
mkdir -p "$PKG_DIR"
|
||||
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
|
||||
|
||||
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
|
||||
OUT_TAR="$OUT_TAR_DIR/client_gpu_${VERSION}.tar.gz"
|
||||
log "Creating tarball: $OUT_TAR"
|
||||
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
|
||||
log "Client-GPU package ready: $PKG_DIR"
|
||||
echo "$OUT_TAR"
|
||||
@ -1,160 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Make server deployment package (versioned, per-image tars, full compose, docs, skeleton)
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/server"
|
||||
ART_ROOT="$ROOT_DIR/deployment_new/artifact/server"
|
||||
|
||||
# Use deployment_new local common helpers
|
||||
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
|
||||
. "$COMMON_SH"
|
||||
|
||||
usage(){ cat <<EOF
|
||||
Build Server Deployment Package (deployment_new)
|
||||
|
||||
Usage: $(basename "$0") --version YYYYMMDD
|
||||
|
||||
Outputs: deployment_new/artifact/server/<YYYYMMDD>/ and server_YYYYMMDD.tar.gz
|
||||
EOF
|
||||
}
|
||||
|
||||
VERSION=""
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--version) VERSION="$2"; shift 2;;
|
||||
-h|--help) usage; exit 0;;
|
||||
*) err "unknown arg: $1"; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
|
||||
|
||||
require_cmd docker tar gzip awk sed
|
||||
|
||||
IMAGES=(
|
||||
argus-master
|
||||
argus-elasticsearch
|
||||
argus-kibana
|
||||
argus-metric-prometheus
|
||||
argus-metric-grafana
|
||||
argus-alertmanager
|
||||
argus-web-frontend
|
||||
argus-web-proxy
|
||||
)
|
||||
|
||||
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
|
||||
PKG_DIR="$ART_ROOT/$VERSION"
|
||||
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
|
||||
|
||||
# 1) Save per-image tars with version tag
|
||||
log "Tagging and saving images (version=$VERSION)"
|
||||
for repo in "${IMAGES[@]}"; do
|
||||
if ! docker image inspect "$repo:latest" >/dev/null 2>&1 && ! docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
|
||||
err "missing image: $repo (need :latest or :$VERSION)"; exit 1; fi
|
||||
if docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
|
||||
tag="$repo:$VERSION"
|
||||
else
|
||||
docker tag "$repo:latest" "$repo:$VERSION"
|
||||
tag="$repo:$VERSION"
|
||||
fi
|
||||
out_tar="$STAGE/images/${repo//\//-}-$VERSION.tar"
|
||||
docker save -o "$out_tar" "$tag"
|
||||
gzip -f "$out_tar"
|
||||
done
|
||||
|
||||
# 2) Compose + env template
|
||||
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
|
||||
ENV_EX="$STAGE/compose/.env.example"
|
||||
cat >"$ENV_EX" <<EOF
|
||||
# Generated by make_server_package.sh
|
||||
PKG_VERSION=$VERSION
|
||||
|
||||
# Image tags (can be overridden). Default to versioned tags
|
||||
MASTER_IMAGE_TAG=argus-master:
|
||||
ES_IMAGE_TAG=argus-elasticsearch:
|
||||
KIBANA_IMAGE_TAG=argus-kibana:
|
||||
PROM_IMAGE_TAG=argus-metric-prometheus:
|
||||
GRAFANA_IMAGE_TAG=argus-metric-grafana:
|
||||
ALERT_IMAGE_TAG=argus-alertmanager:
|
||||
FRONT_IMAGE_TAG=argus-web-frontend:
|
||||
WEB_PROXY_IMAGE_TAG=argus-web-proxy:
|
||||
EOF
|
||||
sed -i "s#:\$#:${VERSION}#g" "$ENV_EX"
|
||||
|
||||
# Ports and defaults (based on swarm_tests .env.example)
|
||||
cat >>"$ENV_EX" <<'EOF'
|
||||
|
||||
# Host ports for server compose
|
||||
MASTER_PORT=32300
|
||||
ES_HTTP_PORT=9200
|
||||
KIBANA_PORT=5601
|
||||
PROMETHEUS_PORT=9090
|
||||
GRAFANA_PORT=3000
|
||||
ALERTMANAGER_PORT=9093
|
||||
WEB_PROXY_PORT_8080=8080
|
||||
WEB_PROXY_PORT_8081=8081
|
||||
WEB_PROXY_PORT_8082=8082
|
||||
WEB_PROXY_PORT_8083=8083
|
||||
WEB_PROXY_PORT_8084=8084
|
||||
WEB_PROXY_PORT_8085=8085
|
||||
|
||||
# Overlay network name
|
||||
ARGUS_OVERLAY_NET=argus-sys-net
|
||||
|
||||
# UID/GID for volume ownership
|
||||
ARGUS_BUILD_UID=2133
|
||||
ARGUS_BUILD_GID=2015
|
||||
|
||||
# Compose project name (isolation from other stacks on same host)
|
||||
COMPOSE_PROJECT_NAME=argus-server
|
||||
EOF
|
||||
|
||||
# 3) Docs (from deployment_new templates)
|
||||
DOCS_SRC="$TEMPL_DIR/docs"
|
||||
if [[ -d "$DOCS_SRC" ]]; then
|
||||
rsync -a "$DOCS_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$DOCS_SRC/." "$STAGE/docs/"
|
||||
fi
|
||||
|
||||
# 6) Scripts (from deployment_new templates)
|
||||
SCRIPTS_SRC="$TEMPL_DIR/scripts"
|
||||
if [[ -d "$SCRIPTS_SRC" ]]; then
|
||||
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
|
||||
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# 4) Private skeleton (minimum)
|
||||
mkdir -p \
|
||||
"$STAGE/private/argus/etc" \
|
||||
"$STAGE/private/argus/master" \
|
||||
"$STAGE/private/argus/metric/prometheus" \
|
||||
"$STAGE/private/argus/metric/prometheus/data" \
|
||||
"$STAGE/private/argus/metric/prometheus/rules" \
|
||||
"$STAGE/private/argus/metric/prometheus/targets" \
|
||||
"$STAGE/private/argus/metric/grafana" \
|
||||
"$STAGE/private/argus/metric/grafana/data" \
|
||||
"$STAGE/private/argus/metric/grafana/logs" \
|
||||
"$STAGE/private/argus/metric/grafana/plugins" \
|
||||
"$STAGE/private/argus/metric/grafana/provisioning/datasources" \
|
||||
"$STAGE/private/argus/metric/grafana/provisioning/dashboards" \
|
||||
"$STAGE/private/argus/metric/grafana/data/sessions" \
|
||||
"$STAGE/private/argus/metric/grafana/data/dashboards" \
|
||||
"$STAGE/private/argus/metric/grafana/config" \
|
||||
"$STAGE/private/argus/alert/alertmanager" \
|
||||
"$STAGE/private/argus/log/elasticsearch" \
|
||||
"$STAGE/private/argus/log/kibana"
|
||||
|
||||
# 7) Manifest & checksums
|
||||
gen_manifest "$STAGE" "$STAGE/manifest.txt"
|
||||
checksum_dir "$STAGE" "$STAGE/checksums.txt"
|
||||
|
||||
# 8) Move to artifact dir and pack
|
||||
mkdir -p "$PKG_DIR"
|
||||
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
|
||||
|
||||
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
|
||||
OUT_TAR="$OUT_TAR_DIR/server_${VERSION}.tar.gz"
|
||||
log "Creating tarball: $OUT_TAR"
|
||||
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
|
||||
log "Server package ready: $PKG_DIR"
|
||||
echo "$OUT_TAR"
|
||||
@ -1,38 +0,0 @@
|
||||
version: "3.8"
|
||||
|
||||
networks:
|
||||
argus-sys-net:
|
||||
external: true
|
||||
|
||||
services:
|
||||
metric-gpu-node:
|
||||
image: ${NODE_GPU_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle-gpu:${PKG_VERSION}}
|
||||
container_name: argus-metric-gpu-node-swarm
|
||||
hostname: ${GPU_NODE_HOSTNAME}
|
||||
restart: unless-stopped
|
||||
privileged: true
|
||||
runtime: nvidia
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
- DEBIAN_FRONTEND=noninteractive
|
||||
- MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
|
||||
# Fluent Bit / 日志上报目标(固定域名)
|
||||
- ES_HOST=es.log.argus.com
|
||||
- ES_PORT=9200
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- AGENT_ENV=${AGENT_ENV}
|
||||
- AGENT_USER=${AGENT_USER}
|
||||
- AGENT_INSTANCE=${AGENT_INSTANCE}
|
||||
- NVIDIA_VISIBLE_DEVICES=all
|
||||
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
|
||||
- GPU_MODE=gpu
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- ${AGENT_INSTANCE}.node.argus.com
|
||||
volumes:
|
||||
- ../private/argus/agent:/private/argus/agent
|
||||
- ../logs/infer:/logs/infer
|
||||
- ../logs/train:/logs/train
|
||||
command: ["sleep", "infinity"]
|
||||
@ -1,73 +0,0 @@
|
||||
# Argus Client‑GPU 安装指南(deployment_new)
|
||||
|
||||
## 一、准备条件(开始前确认)
|
||||
- GPU 节点安装了 NVIDIA 驱动,`nvidia-smi` 正常;
|
||||
- Docker & Docker Compose v2 已安装;
|
||||
- 使用统一账户 `argus`(UID=2133,GID=2015)执行安装,并加入 `docker` 组(如已创建可跳过):
|
||||
```bash
|
||||
sudo groupadd --gid 2015 argus || true
|
||||
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
|
||||
sudo passwd argus
|
||||
sudo usermod -aG docker argus
|
||||
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
|
||||
```
|
||||
后续解压与执行(config/install/uninstall)均使用 `argus` 账户进行。
|
||||
- 从 Server 安装方拿到 `cluster-info.env`(包含 `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`;compose 架构下 BINDIP/FTPIP 不再使用)。
|
||||
|
||||
## 二、解包
|
||||
- `tar -xzf client_gpu_YYYYMMDD.tar.gz`
|
||||
- 进入目录:`cd client_gpu_YYYYMMDD/`
|
||||
- 你应当看到:`images/`(GPU bundle、busybox)、`compose/`、`scripts/`、`docs/`。
|
||||
|
||||
## 三、配置 config(预热 overlay + 生成 .env)
|
||||
命令:
|
||||
```
|
||||
cp /path/to/cluster-info.env ./ # 或 export CLUSTER_INFO=/abs/path/cluster-info.env
|
||||
./scripts/config.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 读取 `cluster-info.env` 并 `docker swarm join`(幂等);
|
||||
- 自动用 busybox 预热 external overlay `argus-sys-net`,等待最多 60s 直到本机可见;
|
||||
- 生成/更新 `compose/.env`:填入 `SWARM_*`,并“保留你已填写的 AGENT_* 与 GPU_NODE_HOSTNAME”(不会覆盖)。
|
||||
|
||||
看到什么才算成功:
|
||||
- 终端输出类似:`已预热 overlay=argus-sys-net 并生成 compose/.env;可执行 scripts/install.sh`;
|
||||
- `compose/.env` 至少包含:
|
||||
- `AGENT_ENV/AGENT_USER/AGENT_INSTANCE/GPU_NODE_HOSTNAME`(需要你提前填写);
|
||||
- `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`;
|
||||
- `NODE_GPU_BUNDLE_IMAGE_TAG=...:YYYYMMDD`。
|
||||
|
||||
### 日志映射(重要)
|
||||
- 容器内 `/logs/infer` 与 `/logs/train` 已映射到包根 `./logs/infer` 与 `./logs/train`:
|
||||
- 你可以直接在宿主机查看推理/训练日志:`tail -f logs/infer/*.log`、`tail -f logs/train/*.log`;
|
||||
- install 脚本会自动创建这两个目录。
|
||||
|
||||
若提示缺少必填项:
|
||||
- 打开 `compose/.env` 按提示补齐 `AGENT_*` 与 `GPU_NODE_HOSTNAME`,再次执行 `./scripts/config.sh`(脚本不会覆盖你已填的值)。
|
||||
|
||||
## 四、安装 install(加载镜像 + 起容器 + 跟日志)
|
||||
命令:
|
||||
```
|
||||
./scripts/install.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 如有必要,先自动预热 overlay;
|
||||
- 从 `images/` 导入 `argus-sys-metric-test-node-bundle-gpu-*.tar.gz` 到本地 Docker;
|
||||
- `docker compose up -d` 启动 GPU 节点容器,并自动执行 `docker logs -f argus-metric-gpu-node-swarm` 跟踪安装过程。
|
||||
|
||||
看到什么才算成功:
|
||||
- 日志中出现:`[BOOT] local bundle install OK: version=...` / `dcgm-exporter ... listening` / `node state present: /private/argus/agent/<hostname>/node.json`;
|
||||
- `docker exec argus-metric-gpu-node-swarm nvidia-smi -L` 能列出 GPU;
|
||||
- 在 Server 侧 Prometheus `/api/v1/targets` 中,GPU 节点 9100(node-exporter)与 9400(dcgm-exporter)至少其一 up。
|
||||
|
||||
## 五、卸载 uninstall
|
||||
命令:
|
||||
```
|
||||
./scripts/uninstall.sh
|
||||
```
|
||||
行为:Compose down(如有 .env),并删除 warmup 容器与节点容器。
|
||||
|
||||
## 六、常见问题
|
||||
- `本机未看到 overlay`:config/install 已自动预热;若仍失败,请检查与 manager 的网络连通性以及 manager 上是否已创建 `argus-sys-net`。
|
||||
- `busybox 缺失`:确保包根 `images/busybox.tar` 在,或主机已有 `busybox:latest`。
|
||||
- `加入 Swarm 失败`:确认 `cluster-info.env` 的 `SWARM_MANAGER_ADDR` 与 `SWARM_JOIN_TOKEN_WORKER` 正确,或在 manager 上重新 `docker swarm join-token -q worker` 后更新该文件。
|
||||
Binary file not shown.
@ -1,90 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_EX="$PKG_ROOT/compose/.env.example"
|
||||
ENV_OUT="$PKG_ROOT/compose/.env"
|
||||
|
||||
info(){ echo -e "\033[34m[CONFIG-GPU]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require docker curl jq awk sed tar gzip
|
||||
require_compose
|
||||
|
||||
# 磁盘空间检查(MB)
|
||||
check_disk(){ local p="$1"; local need=10240; local free
|
||||
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
|
||||
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
|
||||
}
|
||||
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
|
||||
|
||||
# 导入 cluster-info.env(默认取当前包根,也可用 CLUSTER_INFO 指定路径)
|
||||
CI_IN="${CLUSTER_INFO:-$PKG_ROOT/cluster-info.env}"
|
||||
info "读取 cluster-info.env: $CI_IN"
|
||||
[[ -f "$CI_IN" ]] || { err "找不到 cluster-info.env(默认当前包根,或设置环境变量 CLUSTER_INFO 指定绝对路径)"; exit 1; }
|
||||
set -a; source "$CI_IN"; set +a
|
||||
[[ -n "${SWARM_MANAGER_ADDR:-}" && -n "${SWARM_JOIN_TOKEN_WORKER:-}" ]] || { err "cluster-info.env 缺少 SWARM 信息(SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_WORKER)"; exit 1; }
|
||||
|
||||
# 加入 Swarm(幂等)
|
||||
info "加入 Swarm(幂等):$SWARM_MANAGER_ADDR"
|
||||
docker swarm join --token "$SWARM_JOIN_TOKEN_WORKER" "$SWARM_MANAGER_ADDR":2377 >/dev/null 2>&1 || true
|
||||
|
||||
# 导入 busybox 并做 overlay 预热与连通性(总是执行)
|
||||
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
|
||||
# 准备 busybox
|
||||
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
|
||||
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then
|
||||
info "加载 busybox.tar 以预热 overlay"
|
||||
docker load -i "$PKG_ROOT/images/busybox.tar" >/dev/null
|
||||
else
|
||||
err "缺少 busybox 镜像(包内 images/busybox.tar 或本地 busybox:latest),无法预热 overlay $NET_NAME"; exit 1
|
||||
fi
|
||||
fi
|
||||
# 预热容器(worker 侧加入 overlay 以便本地可见)
|
||||
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
|
||||
info "启动 warmup 容器加入 overlay: $NET_NAME"
|
||||
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
|
||||
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && { info "overlay 可见 (t=${i}s)"; break; }; sleep 1; done
|
||||
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
|
||||
|
||||
# 通过 warmup 容器测试实际数据通路(alias → master)
|
||||
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
|
||||
err "warmup 容器内无法通过别名访问 master.argus.com;请确认 server compose 已启动并加入 overlay $NET_NAME"
|
||||
exit 1
|
||||
fi
|
||||
info "warmup 容器内可达 master.argus.com(Docker DNS + alias 正常)"
|
||||
|
||||
# 生成/更新 .env(保留人工填写项,不覆盖已有键)
|
||||
if [[ ! -f "$ENV_OUT" ]]; then
|
||||
cp "$ENV_EX" "$ENV_OUT"
|
||||
fi
|
||||
|
||||
set_kv(){ local k="$1" v="$2"; if grep -q "^${k}=" "$ENV_OUT"; then sed -i -E "s#^${k}=.*#${k}=${v}#" "$ENV_OUT"; else echo "${k}=${v}" >> "$ENV_OUT"; fi }
|
||||
|
||||
set_kv SWARM_MANAGER_ADDR "${SWARM_MANAGER_ADDR:-}"
|
||||
set_kv SWARM_JOIN_TOKEN_WORKER "${SWARM_JOIN_TOKEN_WORKER:-}"
|
||||
set_kv SWARM_JOIN_TOKEN_MANAGER "${SWARM_JOIN_TOKEN_MANAGER:-}"
|
||||
|
||||
REQ_VARS=(AGENT_ENV AGENT_USER AGENT_INSTANCE GPU_NODE_HOSTNAME)
|
||||
missing=()
|
||||
for v in "${REQ_VARS[@]}"; do
|
||||
val=$(grep -E "^$v=" "$ENV_OUT" | head -1 | cut -d= -f2-)
|
||||
if [[ -z "$val" ]]; then missing+=("$v"); fi
|
||||
done
|
||||
if [[ ${#missing[@]} -gt 0 ]]; then
|
||||
err "以下变量必须在 compose/.env 中填写:${missing[*]}(已保留你现有的内容,不会被覆盖)"; exit 1; fi
|
||||
|
||||
info "已生成 compose/.env;可执行 scripts/install.sh"
|
||||
|
||||
# 准备并赋权宿主日志目录(幂等,便于安装前人工检查/预创建)
|
||||
mkdir -p "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer"
|
||||
chmod 1777 "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" || true
|
||||
info "日志目录权限(期待 1777,含粘滞位):"
|
||||
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" 2>/dev/null || true
|
||||
@ -1,72 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
info(){ echo -e "\033[34m[INSTALL-GPU]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require docker nvidia-smi
|
||||
require_compose
|
||||
|
||||
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env,请先运行 scripts/config.sh"; exit 1; }
|
||||
info "使用环境文件: $ENV_FILE"
|
||||
|
||||
# 预热 overlay(当 config 执行很久之前或容器已被清理时,warmup 可能不存在)
|
||||
set -a; source "$ENV_FILE"; set +a
|
||||
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
|
||||
info "检查 overlay 网络可见性: $NET_NAME"
|
||||
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
|
||||
# 如 Overlay 不可见,尝试用 busybox 预热(仅为确保 worker 节点已加入 overlay)
|
||||
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
|
||||
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then docker load -i "$PKG_ROOT/images/busybox.tar"; else err "缺少 busybox 镜像(images/busybox.tar 或本地 busybox:latest)"; exit 1; fi
|
||||
fi
|
||||
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
|
||||
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
|
||||
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && break; sleep 1; done
|
||||
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
|
||||
info "overlay 已可见(warmup=argus-net-warmup)"
|
||||
fi
|
||||
|
||||
# 若本函数内重新创建了 warmup 容器,同样测试一次 alias 数据通路
|
||||
if docker ps --format '{{.Names}}' | grep -q '^argus-net-warmup$'; then
|
||||
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
|
||||
err "GPU install 阶段:warmup 容器内无法通过别名访问 master.argus.com;请检查 overlay $NET_NAME 与 server 状态"
|
||||
exit 1
|
||||
fi
|
||||
info "GPU install 阶段:warmup 容器内可达 master.argus.com"
|
||||
fi
|
||||
|
||||
# 导入 GPU bundle 镜像
|
||||
IMG_TGZ=$(ls -1 "$PKG_ROOT"/images/argus-sys-metric-test-node-bundle-gpu-*.tar.gz 2>/dev/null | head -1 || true)
|
||||
[[ -n "$IMG_TGZ" ]] || { err "找不到 GPU bundle 镜像 tar.gz"; exit 1; }
|
||||
info "导入 GPU bundle 镜像: $(basename "$IMG_TGZ")"
|
||||
tmp=$(mktemp); gunzip -c "$IMG_TGZ" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
|
||||
|
||||
# 确保日志目录存在(宿主侧,用于映射 /logs/infer 与 /logs/train),并赋权 1777(粘滞位)
|
||||
mkdir -p "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train"
|
||||
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
|
||||
info "日志目录已准备并赋权 1777: logs/infer logs/train"
|
||||
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
|
||||
|
||||
# 启动 compose 并跟踪日志
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
|
||||
info "启动 GPU 节点 (docker compose -p $PROJECT up -d)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
|
||||
|
||||
# 再次校准宿主日志目录权限,避免容器内脚本对 bind mount 权限回退
|
||||
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
|
||||
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
|
||||
|
||||
info "跟踪节点容器日志(按 Ctrl+C 退出)"
|
||||
docker logs -f argus-metric-gpu-node-swarm || true
|
||||
@ -1,36 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
# load COMPOSE_PROJECT_NAME if provided in compose/.env
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
|
||||
|
||||
info(){ echo -e "\033[34m[UNINSTALL-GPU]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require_compose
|
||||
|
||||
if [[ -f "$ENV_FILE" ]]; then
|
||||
info "stopping compose project (project=$PROJECT)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
|
||||
else
|
||||
info "compose/.env not found; attempting to remove container by name"
|
||||
fi
|
||||
|
||||
# remove warmup container if still running
|
||||
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
|
||||
|
||||
# remove node container if present
|
||||
docker rm -f argus-metric-gpu-node-swarm >/dev/null 2>&1 || true
|
||||
|
||||
info "uninstall completed"
|
||||
@ -1,169 +0,0 @@
|
||||
version: "3.8"
|
||||
|
||||
networks:
|
||||
argus-sys-net:
|
||||
external: true
|
||||
|
||||
services:
|
||||
master:
|
||||
image: ${MASTER_IMAGE_TAG:-argus-master:${PKG_VERSION}}
|
||||
container_name: argus-master-sys
|
||||
environment:
|
||||
- OFFLINE_THRESHOLD_SECONDS=6
|
||||
- ONLINE_THRESHOLD_SECONDS=2
|
||||
- SCHEDULER_INTERVAL_SECONDS=1
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
ports:
|
||||
- "${MASTER_PORT:-32300}:3000"
|
||||
volumes:
|
||||
- ../private/argus/master:/private/argus/master
|
||||
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- master.argus.com
|
||||
restart: unless-stopped
|
||||
|
||||
es:
|
||||
image: ${ES_IMAGE_TAG:-argus-elasticsearch:${PKG_VERSION}}
|
||||
container_name: argus-es-sys
|
||||
environment:
|
||||
- discovery.type=single-node
|
||||
- xpack.security.enabled=false
|
||||
- ES_JAVA_OPTS=-Xms512m -Xmx512m
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/log/elasticsearch:/private/argus/log/elasticsearch
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
ports:
|
||||
- "${ES_HTTP_PORT:-9200}:9200"
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- es.log.argus.com
|
||||
|
||||
kibana:
|
||||
image: ${KIBANA_IMAGE_TAG:-argus-kibana:${PKG_VERSION}}
|
||||
container_name: argus-kibana-sys
|
||||
environment:
|
||||
- ELASTICSEARCH_HOSTS=http://es.log.argus.com:9200
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/log/kibana:/private/argus/log/kibana
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
depends_on: [es]
|
||||
ports:
|
||||
- "${KIBANA_PORT:-5601}:5601"
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- kibana.log.argus.com
|
||||
|
||||
prometheus:
|
||||
image: ${PROM_IMAGE_TAG:-argus-metric-prometheus:${PKG_VERSION}}
|
||||
container_name: argus-prometheus
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
- PROMETHEUS_BASE_PATH=/private/argus/metric/prometheus
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
ports:
|
||||
- "${PROMETHEUS_PORT:-9090}:9090"
|
||||
volumes:
|
||||
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- prom.metric.argus.com
|
||||
|
||||
grafana:
|
||||
image: ${GRAFANA_IMAGE_TAG:-argus-metric-grafana:${PKG_VERSION}}
|
||||
container_name: argus-grafana
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
- GRAFANA_BASE_PATH=/private/argus/metric/grafana
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- GF_SERVER_HTTP_PORT=3000
|
||||
- GF_LOG_LEVEL=warn
|
||||
- GF_LOG_MODE=console
|
||||
- GF_PATHS_PROVISIONING=/private/argus/metric/grafana/provisioning
|
||||
- GF_AUTH_ANONYMOUS_ENABLED=true
|
||||
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
|
||||
ports:
|
||||
- "${GRAFANA_PORT:-3000}:3000"
|
||||
volumes:
|
||||
- ../private/argus/metric/grafana:/private/argus/metric/grafana
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
depends_on: [prometheus]
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- grafana.metric.argus.com
|
||||
|
||||
alertmanager:
|
||||
image: ${ALERT_IMAGE_TAG:-argus-alertmanager:${PKG_VERSION}}
|
||||
container_name: argus-alertmanager
|
||||
environment:
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
- ../private/argus/alert/alertmanager:/private/argus/alert/alertmanager
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- alertmanager.alert.argus.com
|
||||
ports:
|
||||
- "${ALERTMANAGER_PORT:-9093}:9093"
|
||||
restart: unless-stopped
|
||||
|
||||
web-frontend:
|
||||
image: ${FRONT_IMAGE_TAG:-argus-web-frontend:${PKG_VERSION}}
|
||||
container_name: argus-web-frontend
|
||||
environment:
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- EXTERNAL_MASTER_PORT=${WEB_PROXY_PORT_8085:-8085}
|
||||
- EXTERNAL_ALERTMANAGER_PORT=${WEB_PROXY_PORT_8084:-8084}
|
||||
- EXTERNAL_GRAFANA_PORT=${WEB_PROXY_PORT_8081:-8081}
|
||||
- EXTERNAL_PROMETHEUS_PORT=${WEB_PROXY_PORT_8082:-8082}
|
||||
- EXTERNAL_KIBANA_PORT=${WEB_PROXY_PORT_8083:-8083}
|
||||
volumes:
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- web.argus.com
|
||||
restart: unless-stopped
|
||||
|
||||
web-proxy:
|
||||
image: ${WEB_PROXY_IMAGE_TAG:-argus-web-proxy:${PKG_VERSION}}
|
||||
container_name: argus-web-proxy
|
||||
depends_on: [master, grafana, prometheus, kibana, alertmanager]
|
||||
environment:
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- proxy.argus.com
|
||||
ports:
|
||||
- "${WEB_PROXY_PORT_8080:-8080}:8080"
|
||||
- "${WEB_PROXY_PORT_8081:-8081}:8081"
|
||||
- "${WEB_PROXY_PORT_8082:-8082}:8082"
|
||||
- "${WEB_PROXY_PORT_8083:-8083}:8083"
|
||||
- "${WEB_PROXY_PORT_8084:-8084}:8084"
|
||||
- "${WEB_PROXY_PORT_8085:-8085}:8085"
|
||||
restart: unless-stopped
|
||||
@ -1,102 +0,0 @@
|
||||
# Argus Server 安装指南(deployment_new)
|
||||
|
||||
适用:通过 Server 安装包在 Docker Swarm + external overlay 网络一体化部署 Argus 服务端组件。
|
||||
|
||||
—— 本文强调“怎么做、看什么、符合了才继续”。
|
||||
|
||||
## 一、准备条件(开始前确认)
|
||||
- Docker 与 Docker Compose v2 已安装;`docker info` 正常;`docker compose version` 可执行。
|
||||
- 具备 root/sudo 权限;磁盘可用空间 ≥ 10GB(包根与 `/var/lib/docker`)。
|
||||
- 你知道本机管理地址(SWARM_MANAGER_ADDR),该 IP 属于本机某网卡,可被其他节点访问。
|
||||
- 很重要:以统一账户 `argus`(UID=2133,GID=2015)执行后续安装与运维,并将其加入 `docker` 组;示例命令如下(如需不同 UID/GID,请替换为贵方标准):
|
||||
```bash
|
||||
# 1) 创建主组(GID=2015,组名 argus;若已存在可跳过)
|
||||
sudo groupadd --gid 2015 argus || true
|
||||
|
||||
# 2) 创建用户 argus(UID=2133、主组 GID=2015,创建家目录并用 bash 作为默认 shell;若已存在可用 usermod 调整)
|
||||
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
|
||||
sudo passwd argus
|
||||
|
||||
# 3) 将 argus 加入 docker 组,使其能调用 Docker Daemon(新登录后生效)
|
||||
sudo usermod -aG docker argus
|
||||
|
||||
# 4) 验证(重新登录或执行 newgrp docker 使组生效)
|
||||
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
|
||||
```
|
||||
后续的解压与执行(config/install/selfcheck 等)均使用该 `argus` 账户进行。
|
||||
|
||||
## 二、解包与目录结构
|
||||
- 解压:`tar -xzf server_YYYYMMDD.tar.gz`。
|
||||
- 进入:`cd server_YYYYMMDD/`
|
||||
- 你应当能看到:
|
||||
- `images/`(逐服务镜像 tar.gz,如 `argus-master-YYYYMMDD.tar.gz`)
|
||||
- `compose/`(`docker-compose.yml` 与 `.env.example`)
|
||||
- `scripts/`(安装/运维脚本)
|
||||
- `private/argus/`(数据与配置骨架)
|
||||
- `docs/`(中文文档)
|
||||
|
||||
## 三、配置 config(生成 .env 与 SWARM_MANAGER_ADDR)
|
||||
命令:
|
||||
```
|
||||
export SWARM_MANAGER_ADDR=<本机管理IP>
|
||||
./scripts/config.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 检查依赖与磁盘空间;
|
||||
- 自动从“端口 20000 起”分配所有服务端口,确保“系统未占用”且“彼此不冲突”;
|
||||
- 写入 `compose/.env`(包含端口、镜像 tag、overlay 名称与 UID/GID 等);
|
||||
- 将当前执行账户的 UID/GID 写入 `ARGUS_BUILD_UID/GID`(若主组名是 docker,会改用“与用户名同名的组”的 GID,避免拿到 docker 组 999);
|
||||
- 更新/追加 `cluster-info.env` 中的 `SWARM_MANAGER_ADDR`(不会覆盖其他键)。
|
||||
|
||||
看到什么才算成功:
|
||||
- 终端输出:`已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。`
|
||||
- `compose/.env` 打开应当看到:
|
||||
- 端口均 ≥20000 且没有重复;
|
||||
- `ARGUS_BUILD_UID/GID` 与 `id -u/-g` 一致;
|
||||
- `SWARM_MANAGER_ADDR=<你的IP>`。
|
||||
|
||||
遇到问题:
|
||||
- 端口被异常占用:可删去 `.env` 后再次执行 `config.sh`,或手工编辑端口再执行 `install.sh`。
|
||||
|
||||
## 四、安装 install(一次到位)
|
||||
命令:
|
||||
```
|
||||
./scripts/install.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 若 Swarm 未激活:执行 `docker swarm init --advertise-addr $SWARM_MANAGER_ADDR`;
|
||||
- 确保 external overlay `argus-sys-net` 存在;
|
||||
- 导入 `images/*.tar.gz` 到本机 Docker;
|
||||
- `docker compose up -d` 启动服务;
|
||||
- 等待“六项就绪”:
|
||||
- Master `/readyz`=200、ES `/_cluster/health`=200、Prometheus TCP 可达、Grafana `/api/health`=200、Alertmanager `/api/v2/status`=200、Kibana `/api/status` level=available;
|
||||
- 校验 Docker DNS + overlay alias:在 `argus-web-proxy` 内通过 `getent hosts` 与 `curl` 检查 `master.argus.com`、`grafana.metric.argus.com` 等域名连通性;
|
||||
- 写出 `cluster-info.env`(含 `SWARM_JOIN_TOKEN_{WORKER,MANAGER}/SWARM_MANAGER_ADDR`;compose 架构下不再依赖 BINDIP/FTPIP);
|
||||
- 生成 `安装报告_YYYYMMDD-HHMMSS.md`(端口、健康检查摘要与提示)。
|
||||
|
||||
看到什么才算成功:
|
||||
- `docker compose ps` 全部是 Up;
|
||||
- `安装报告_…md` 中各项 HTTP 检查为 200/available;
|
||||
- `cluster-info.env` 包含五个关键键:
|
||||
- `SWARM_MANAGER_ADDR=...`
|
||||
- `SWARM_MANAGER_ADDR=...` `SWARM_JOIN_TOKEN_*=...`
|
||||
- `SWARM_JOIN_TOKEN_WORKER=SWMTKN-...`
|
||||
- `SWARM_JOIN_TOKEN_MANAGER=SWMTKN-...`
|
||||
|
||||
## 五、健康自检与常用操作
|
||||
- 健康自检:`./scripts/selfcheck.sh`
|
||||
- 期望输出:`selfcheck OK -> logs/selfcheck.json`
|
||||
- 文件 `logs/selfcheck.json` 中 `overlay_net/es/kibana/master_readyz/prometheus/grafana/alertmanager/web_proxy_cors` 为 true。
|
||||
- 状态:`./scripts/status.sh`(相当于 `docker compose ps`)。
|
||||
- 诊断:`./scripts/diagnose.sh`(收集容器/HTTP/CORS/ES 细节,输出到 `logs/diagnose_*.log`)。
|
||||
- 卸载:`./scripts/uninstall.sh`(Compose down)。
|
||||
- ES 磁盘水位临时放宽/还原:`./scripts/es-watermark-relax.sh` / `./scripts/es-watermark-restore.sh`。
|
||||
|
||||
## 六、下一步:分发 cluster-info.env 给 Client
|
||||
- 将 `cluster-info.env` 拷贝给安装 Client 的同事;
|
||||
- 对方在 Client 机器的包根放置该文件(或设置 `CLUSTER_INFO=/绝对路径`)即可。
|
||||
|
||||
## 七、故障排查快览
|
||||
- Proxy 502 或 8080 连接复位:通常是 overlay alias 未生效或 web-proxy 尚未解析到其它服务;重跑 `install.sh`(会重启栈并在容器内校验 DNS),或查看 `logs/diagnose_error.log`。
|
||||
- Kibana 不 available:等待 1–2 分钟、查看 `argus-kibana-sys` 日志;
|
||||
- cluster-info.env 的 SWARM_MANAGER_ADDR 为空:重新 `export SWARM_MANAGER_ADDR=<IP>; ./scripts/config.sh` 或 `./scripts/install.sh`(会回读 `.env` 补写)。
|
||||
@ -1,7 +0,0 @@
|
||||
# Docker Swarm 部署要点
|
||||
|
||||
- 初始化 Swarm:`docker swarm init --advertise-addr <SWARM_MANAGER_ADDR>`
|
||||
- 创建 overlay:`docker network create --driver overlay --attachable argus-sys-net`
|
||||
- Server 包 `install.sh` 自动完成上述操作;如需手动执行,确保 `argus-sys-net` 存在且 attachable。
|
||||
- Worker 节点加入:`docker swarm join --token <worker_token> <SWARM_MANAGER_ADDR>:2377`。
|
||||
|
||||
@ -1,11 +0,0 @@
|
||||
# 故障排查(Server)
|
||||
|
||||
- 端口占用:查看 `安装报告_*.md` 中端口表;如需修改,编辑 `compose/.env` 后执行 `docker compose ... up -d`。
|
||||
- 组件未就绪:
|
||||
- Master: `curl http://127.0.0.1:${MASTER_PORT}/readyz -I`
|
||||
- ES: `curl http://127.0.0.1:${ES_HTTP_PORT}/_cluster/health`
|
||||
- Grafana: `curl http://127.0.0.1:${GRAFANA_PORT}/api/health`
|
||||
- Prometheus TCP: `exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT}`
|
||||
- 域名解析:进入 `argus-web-proxy` 或 `argus-master-sys` 容器:`getent hosts master.argus.com`。
|
||||
- Swarm/Overlay:检查 `docker network ls | grep argus-sys-net`,或 `docker node ls`。
|
||||
|
||||
@ -1,108 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_EX="$PKG_ROOT/compose/.env.example"
|
||||
ENV_OUT="$PKG_ROOT/compose/.env"
|
||||
|
||||
info(){ echo -e "\033[34m[CONFIG]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
|
||||
require docker curl jq awk sed tar gzip
|
||||
require_compose
|
||||
|
||||
# 磁盘空间检查(MB)
|
||||
check_disk(){ local p="$1"; local need=10240; local free
|
||||
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
|
||||
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
|
||||
}
|
||||
|
||||
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
|
||||
|
||||
# 读取/生成 SWARM_MANAGER_ADDR
|
||||
SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}
|
||||
if [[ -z "${SWARM_MANAGER_ADDR}" ]]; then
|
||||
read -rp "请输入本机管理地址 SWARM_MANAGER_ADDR: " SWARM_MANAGER_ADDR
|
||||
fi
|
||||
info "SWARM_MANAGER_ADDR=$SWARM_MANAGER_ADDR"
|
||||
|
||||
# 校验 IP 属于本机网卡
|
||||
if ! ip -o addr | awk '{print $4}' | cut -d'/' -f1 | grep -qx "$SWARM_MANAGER_ADDR"; then
|
||||
err "SWARM_MANAGER_ADDR 非本机地址: $SWARM_MANAGER_ADDR"; exit 1; fi
|
||||
|
||||
info "开始分配服务端口(起始=20000,避免系统占用与相互冲突)"
|
||||
is_port_used(){ local p="$1"; ss -tulnH 2>/dev/null | awk '{print $5}' | sed 's/.*://g' | grep -qx "$p"; }
|
||||
declare -A PRESENT=() CHOSEN=() USED=()
|
||||
START_PORT="${START_PORT:-20000}"; cur=$START_PORT
|
||||
ORDER=(MASTER_PORT ES_HTTP_PORT KIBANA_PORT PROMETHEUS_PORT GRAFANA_PORT ALERTMANAGER_PORT \
|
||||
WEB_PROXY_PORT_8080 WEB_PROXY_PORT_8081 WEB_PROXY_PORT_8082 WEB_PROXY_PORT_8083 WEB_PROXY_PORT_8084 WEB_PROXY_PORT_8085 \
|
||||
FTP_PORT FTP_DATA_PORT)
|
||||
|
||||
# 标记 .env.example 中实际存在的键
|
||||
for key in "${ORDER[@]}"; do
|
||||
if grep -q "^${key}=" "$ENV_EX"; then PRESENT[$key]=1; fi
|
||||
done
|
||||
|
||||
next_free(){ local p="$1"; while :; do if [[ -n "${USED[$p]:-}" ]] || is_port_used "$p"; then p=$((p+1)); else echo "$p"; return; fi; done; }
|
||||
|
||||
for key in "${ORDER[@]}"; do
|
||||
[[ -z "${PRESENT[$key]:-}" ]] && continue
|
||||
p=$(next_free "$cur"); CHOSEN[$key]="$p"; USED[$p]=1; cur=$((p+1))
|
||||
done
|
||||
|
||||
info "端口分配结果:MASTER=${CHOSEN[MASTER_PORT]:-} ES=${CHOSEN[ES_HTTP_PORT]:-} KIBANA=${CHOSEN[KIBANA_PORT]:-} PROM=${CHOSEN[PROMETHEUS_PORT]:-} GRAFANA=${CHOSEN[GRAFANA_PORT]:-} ALERT=${CHOSEN[ALERTMANAGER_PORT]:-} WEB_PROXY(8080..8085)=${CHOSEN[WEB_PROXY_PORT_8080]:-}/${CHOSEN[WEB_PROXY_PORT_8081]:-}/${CHOSEN[WEB_PROXY_PORT_8082]:-}/${CHOSEN[WEB_PROXY_PORT_8083]:-}/${CHOSEN[WEB_PROXY_PORT_8084]:-}/${CHOSEN[WEB_PROXY_PORT_8085]:-}"
|
||||
|
||||
cp "$ENV_EX" "$ENV_OUT"
|
||||
# 覆盖端口(按唯一化结果写回)
|
||||
for key in "${ORDER[@]}"; do
|
||||
val="${CHOSEN[$key]:-}"
|
||||
[[ -z "$val" ]] && continue
|
||||
sed -i -E "s#^$key=.*#$key=${val}#" "$ENV_OUT"
|
||||
done
|
||||
info "已写入 compose/.env 的端口配置"
|
||||
# 覆盖/补充 Overlay 名称
|
||||
grep -q '^ARGUS_OVERLAY_NET=' "$ENV_OUT" || echo 'ARGUS_OVERLAY_NET=argus-sys-net' >> "$ENV_OUT"
|
||||
# 以当前执行账户 UID/GID 写入(避免误选 docker 组)
|
||||
RUID=$(id -u)
|
||||
PRIMARY_GID=$(id -g)
|
||||
PRIMARY_GRP=$(id -gn)
|
||||
USER_NAME=$(id -un)
|
||||
# 若主组名被解析为 docker,尝试用与用户名同名的组的 GID;否则回退主 GID
|
||||
if [[ "$PRIMARY_GRP" == "docker" ]]; then
|
||||
RGID=$(getent group "$USER_NAME" | awk -F: '{print $3}' 2>/dev/null || true)
|
||||
[[ -z "$RGID" ]] && RGID="$PRIMARY_GID"
|
||||
else
|
||||
RGID="$PRIMARY_GID"
|
||||
fi
|
||||
info "使用构建账户 UID:GID=${RUID}:${RGID} (user=$USER_NAME primary_group=$PRIMARY_GRP)"
|
||||
if grep -q '^ARGUS_BUILD_UID=' "$ENV_OUT"; then
|
||||
sed -i -E "s#^ARGUS_BUILD_UID=.*#ARGUS_BUILD_UID=${RUID}#" "$ENV_OUT"
|
||||
else
|
||||
echo "ARGUS_BUILD_UID=${RUID}" >> "$ENV_OUT"
|
||||
fi
|
||||
if grep -q '^ARGUS_BUILD_GID=' "$ENV_OUT"; then
|
||||
sed -i -E "s#^ARGUS_BUILD_GID=.*#ARGUS_BUILD_GID=${RGID}#" "$ENV_OUT"
|
||||
else
|
||||
echo "ARGUS_BUILD_GID=${RGID}" >> "$ENV_OUT"
|
||||
fi
|
||||
|
||||
CI="$PKG_ROOT/cluster-info.env"
|
||||
if [[ -f "$CI" ]]; then
|
||||
if grep -q '^SWARM_MANAGER_ADDR=' "$CI"; then
|
||||
sed -i -E "s#^SWARM_MANAGER_ADDR=.*#SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}#" "$CI"
|
||||
else
|
||||
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" >> "$CI"
|
||||
fi
|
||||
else
|
||||
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" > "$CI"
|
||||
fi
|
||||
info "已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。"
|
||||
info "下一步可执行: scripts/install.sh"
|
||||
@ -1,109 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
|
||||
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
|
||||
|
||||
ts="$(date -u +%Y%m%d-%H%M%SZ)"
|
||||
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
|
||||
if ! ( : > "$LOG_DIR/.w" 2>/dev/null ); then LOG_DIR="/tmp/argus-logs"; mkdir -p "$LOG_DIR" || true; fi
|
||||
|
||||
# load compose project for accurate ps output
|
||||
ENV_FILE="$ROOT/compose/.env"
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
DETAILS="$LOG_DIR/diagnose_details_${ts}.log"; ERRORS="$LOG_DIR/diagnose_error_${ts}.log"; : > "$DETAILS"; : > "$ERRORS"
|
||||
|
||||
logd() { echo "$(date '+%F %T') $*" >> "$DETAILS"; }
|
||||
append_err() { echo "$*" >> "$ERRORS"; }
|
||||
http_code() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
|
||||
http_body_head() { curl -s --max-time 3 "$1" 2>/dev/null | sed -n '1,5p' || true; }
|
||||
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
|
||||
|
||||
section() { local name="$1"; logd "===== [$name] ====="; }
|
||||
svc() {
|
||||
local svc_name="$1"; local cname="$2"; shift 2
|
||||
section "$svc_name ($cname)"
|
||||
logd "docker ps:"; docker ps -a --format '{{.Names}}\t{{.Status}}\t{{.Image}}' | awk -v n="$cname" '$1==n' >> "$DETAILS" || true
|
||||
logd "docker inspect:"; docker inspect -f '{{.State.Status}} rc={{.RestartCount}} started={{.State.StartedAt}}' "$cname" >> "$DETAILS" 2>&1 || true
|
||||
logd "last 200 container logs:"; docker logs --tail 200 "$cname" >> "$DETAILS" 2>&1 || true
|
||||
docker logs --tail 200 "$cname" 2>&1 | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][container] /" >> "$ERRORS" || true
|
||||
if docker exec "$cname" sh -lc 'command -v supervisorctl >/dev/null 2>&1' >/dev/null 2>&1; then
|
||||
logd "supervisorctl status:"; docker exec "$cname" sh -lc 'supervisorctl status' >> "$DETAILS" 2>&1 || true
|
||||
local files; files=$(docker exec "$cname" sh -lc 'ls /var/log/supervisor/*.log 2>/dev/null' || true)
|
||||
for f in $files; do
|
||||
logd "tail -n 80 $f:"; docker exec "$cname" sh -lc "tail -n 80 $f 2>/dev/null || true" >> "$DETAILS" 2>&1 || true
|
||||
docker exec "$cname" sh -lc "tail -n 200 $f 2>/dev/null" 2>/dev/null | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][supervisor:$(basename $f)] /" >> "$ERRORS" || true
|
||||
done
|
||||
fi
|
||||
}
|
||||
|
||||
svc master argus-master-sys
|
||||
svc es argus-es-sys
|
||||
svc kibana argus-kibana-sys
|
||||
svc prometheus argus-prometheus
|
||||
svc grafana argus-grafana
|
||||
svc alertmanager argus-alertmanager
|
||||
svc web-frontend argus-web-frontend
|
||||
svc web-proxy argus-web-proxy
|
||||
|
||||
section HTTP
|
||||
logd "ES: $(http_code \"http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health\")"; http_body_head "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" >> "$DETAILS" 2>&1 || true
|
||||
logd "Kibana: $(http_code \"http://localhost:${KIBANA_PORT:-5601}/api/status\")"; http_body_head "http://localhost:${KIBANA_PORT:-5601}/api/status" >> "$DETAILS" 2>&1 || true
|
||||
logd "Master readyz: $(http_code \"http://localhost:${MASTER_PORT:-32300}/readyz\")"
|
||||
logd "Prometheus: $(http_code \"http://localhost:${PROMETHEUS_PORT:-9090}/-/ready\")"
|
||||
logd "Grafana: $(http_code \"http://localhost:${GRAFANA_PORT:-3000}/api/health\")"; http_body_head "http://localhost:${GRAFANA_PORT:-3000}/api/health" >> "$DETAILS" 2>&1 || true
|
||||
logd "Alertmanager: $(http_code \"http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status\")"
|
||||
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
|
||||
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
|
||||
logd "Web-Proxy 8080: $(http_code \"http://localhost:${WEB_PROXY_PORT_8080:-8080}/\")"
|
||||
logd "Web-Proxy 8083: $(http_code \"http://localhost:${WEB_PROXY_PORT_8083:-8083}/\")"
|
||||
logd "Web-Proxy 8084 CORS: ${cors8084}"
|
||||
logd "Web-Proxy 8085 CORS: ${cors8085}"
|
||||
|
||||
section ES-CHECKS
|
||||
ch=$(curl -s --max-time 3 "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" || true)
|
||||
status=$(printf '%s' "$ch" | awk -F'\"' '/"status"/{print $4; exit}')
|
||||
if [[ -n "$status" ]]; then logd "cluster.status=$status"; fi
|
||||
if [[ "$status" != "green" ]]; then append_err "[es][cluster] status=$status"; fi
|
||||
if docker ps --format '{{.Names}}' | grep -q '^argus-es-sys$'; then
|
||||
duse=$(docker exec argus-es-sys sh -lc 'df -P /usr/share/elasticsearch/data | awk "NR==2{print \$5}"' 2>/dev/null || true)
|
||||
logd "es.data.df_use=$duse"; usep=${duse%%%}
|
||||
if [[ -n "$usep" ]] && (( usep >= 90 )); then append_err "[es][disk] data path usage=${usep}%"; fi
|
||||
fi
|
||||
|
||||
section DNS-IN-PROXY
|
||||
for d in master.argus.com es.log.argus.com kibana.log.argus.com grafana.metric.argus.com prom.metric.argus.com alertmanager.alert.argus.com; do
|
||||
docker exec argus-web-proxy sh -lc "getent hosts $d || nslookup $d 2>/dev/null | tail -n+1" >> "$DETAILS" 2>&1 || true
|
||||
done
|
||||
logd "HTTP (web-proxy): master.readyz=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://master.argus.com:3000/readyz\" 2>/dev/null || echo 000)"
|
||||
logd "HTTP (web-proxy): es.health=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://es.log.argus.com:9200/_cluster/health\" 2>/dev/null || echo 000)"
|
||||
logd "HTTP (web-proxy): kibana.status=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://kibana.log.argus.com:5601/api/status\" 2>/dev/null || echo 000)"
|
||||
|
||||
section SYSTEM
|
||||
logd "uname -a:"; uname -a >> "$DETAILS"
|
||||
logd "docker version:"; docker version --format '{{.Server.Version}}' >> "$DETAILS" 2>&1 || true
|
||||
logd "compose ps (project=$PROJECT):"; (cd "$ROOT/compose" && docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f docker-compose.yml ps) >> "$DETAILS" 2>&1 || true
|
||||
|
||||
section SUMMARY
|
||||
[[ $(http_code "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health") != 200 ]] && echo "[es][http] health not 200" >> "$ERRORS"
|
||||
kbcode=$(http_code "http://localhost:${KIBANA_PORT:-5601}/api/status"); [[ "$kbcode" != 200 ]] && echo "[kibana][http] /api/status=$kbcode" >> "$ERRORS"
|
||||
[[ $(http_code "http://localhost:${MASTER_PORT:-32300}/readyz") != 200 ]] && echo "[master][http] /readyz not 200" >> "$ERRORS"
|
||||
[[ $(http_code "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready") != 200 ]] && echo "[prometheus][http] /-/ready not 200" >> "$ERRORS"
|
||||
gfcode=$(http_code "http://localhost:${GRAFANA_PORT:-3000}/api/health"); [[ "$gfcode" != 200 ]] && echo "[grafana][http] /api/health=$gfcode" >> "$ERRORS"
|
||||
[[ $(http_code "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status") != 200 ]] && echo "[alertmanager][http] /api/v2/status not 200" >> "$ERRORS"
|
||||
[[ -z "$cors8084" ]] && echo "[web-proxy][cors] 8084 missing Access-Control-Allow-Origin" >> "$ERRORS"
|
||||
[[ -z "$cors8085" ]] && echo "[web-proxy][cors] 8085 missing Access-Control-Allow-Origin" >> "$ERRORS"
|
||||
sort -u -o "$ERRORS" "$ERRORS"
|
||||
|
||||
echo "Diagnostic details -> $DETAILS"
|
||||
echo "Detected errors -> $ERRORS"
|
||||
|
||||
if [[ "$LOG_DIR" == "$ROOT/logs" ]]; then
|
||||
ln -sfn "$(basename "$DETAILS")" "$ROOT/logs/diagnose_details.log" 2>/dev/null || cp "$DETAILS" "$ROOT/logs/diagnose_details.log" 2>/dev/null || true
|
||||
ln -sfn "$(basename "$ERRORS")" "$ROOT/logs/diagnose_error.log" 2>/dev/null || cp "$ERRORS" "$ROOT/logs/diagnose_error.log" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
exit 0
|
||||
@ -1,11 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
HOST="${1:-http://127.0.0.1:9200}"
|
||||
echo "设置 ES watermark 为 95%/96%/97%: $HOST"
|
||||
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
|
||||
"transient": {
|
||||
"cluster.routing.allocation.disk.watermark.low": "95%",
|
||||
"cluster.routing.allocation.disk.watermark.high": "96%",
|
||||
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
|
||||
}
|
||||
}' && echo "\nOK"
|
||||
@ -1,11 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
HOST="${1:-http://127.0.0.1:9200}"
|
||||
echo "恢复 ES watermark 为默认值: $HOST"
|
||||
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
|
||||
"transient": {
|
||||
"cluster.routing.allocation.disk.watermark.low": null,
|
||||
"cluster.routing.allocation.disk.watermark.high": null,
|
||||
"cluster.routing.allocation.disk.watermark.flood_stage": null
|
||||
}
|
||||
}' && echo "\nOK"
|
||||
@ -1,137 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
info(){ echo -e "\033[34m[INSTALL]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/devnull 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require docker curl jq awk sed tar gzip
|
||||
require_compose
|
||||
|
||||
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env,请先运行 scripts/config.sh"; exit 1; }
|
||||
info "使用环境文件: $ENV_FILE"
|
||||
set -a; source "$ENV_FILE"; set +a
|
||||
# 兼容:若 .env 未包含 SWARM_MANAGER_ADDR,则从已存在的 cluster-info.env 读取以避免写空
|
||||
SMADDR="${SWARM_MANAGER_ADDR:-}"
|
||||
CI_FILE="$PKG_ROOT/cluster-info.env"
|
||||
if [[ -z "$SMADDR" && -f "$CI_FILE" ]]; then
|
||||
SMADDR=$(sed -n 's/^SWARM_MANAGER_ADDR=\(.*\)$/\1/p' "$CI_FILE" | head -n1)
|
||||
fi
|
||||
SWARM_MANAGER_ADDR="$SMADDR"
|
||||
|
||||
# Swarm init & overlay
|
||||
if ! docker info 2>/dev/null | grep -q "Swarm: active"; then
|
||||
[[ -n "${SWARM_MANAGER_ADDR:-}" ]] || { err "SWARM_MANAGER_ADDR 未设置,请在 scripts/config.sh 中配置"; exit 1; }
|
||||
info "初始化 Swarm (--advertise-addr $SWARM_MANAGER_ADDR)"
|
||||
docker swarm init --advertise-addr "$SWARM_MANAGER_ADDR" >/dev/null 2>&1 || true
|
||||
else
|
||||
info "Swarm 已激活"
|
||||
fi
|
||||
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
|
||||
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
|
||||
info "创建 overlay 网络: $NET_NAME"
|
||||
docker network create -d overlay --attachable "$NET_NAME" >/dev/null
|
||||
else
|
||||
info "overlay 网络已存在: $NET_NAME"
|
||||
fi
|
||||
|
||||
# Load images
|
||||
IMAGES_DIR="$PKG_ROOT/images"
|
||||
shopt -s nullglob
|
||||
tars=("$IMAGES_DIR"/*.tar.gz)
|
||||
if [[ ${#tars[@]} -eq 0 ]]; then err "images 目录为空,缺少镜像 tar.gz"; exit 1; fi
|
||||
total=${#tars[@]}; idx=0
|
||||
for tgz in "${tars[@]}"; do
|
||||
idx=$((idx+1))
|
||||
info "导入镜像 ($idx/$total): $(basename "$tgz")"
|
||||
tmp=$(mktemp); gunzip -c "$tgz" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
|
||||
done
|
||||
shopt -u nullglob
|
||||
|
||||
# Compose up
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
info "启动服务栈 (docker compose -p $PROJECT up -d)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
|
||||
|
||||
# Wait readiness (best-effort)
|
||||
code(){ curl -4 -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
|
||||
prom_ok(){ (exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT:-9090}) >/dev/null 2>&1 && return 0 || return 1; }
|
||||
kb_ok(){ local body; body=$(curl -s "http://127.0.0.1:${KIBANA_PORT:-5601}/api/status" || true); echo "$body" | grep -q '"level"\s*:\s*"available"'; }
|
||||
RETRIES=${RETRIES:-60}; SLEEP=${SLEEP:-5}; ok=0
|
||||
info "等待基础服务就绪 (<= $((RETRIES*SLEEP))s)"
|
||||
for i in $(seq 1 "$RETRIES"); do
|
||||
e1=$(code "http://127.0.0.1:${MASTER_PORT:-32300}/readyz")
|
||||
e2=$(code "http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health")
|
||||
e3=000; prom_ok && e3=200
|
||||
e4=$(code "http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health")
|
||||
e5=$(code "http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status")
|
||||
e6=$(kb_ok && echo 200 || echo 000)
|
||||
info "[ready] t=$((i*SLEEP))s master=$e1 es=$e2 prom=$e3 graf=$e4 alert=$e5 kibana=$e6"
|
||||
[[ "$e1" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e2" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e3" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e4" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e5" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e6" == 200 ]] && ok=$((ok+1))
|
||||
if [[ $ok -ge 6 ]]; then break; fi; ok=0; sleep "$SLEEP"
|
||||
done
|
||||
[[ $ok -ge 6 ]] || err "部分服务未就绪(可稍后重试 selfcheck)"
|
||||
|
||||
# Swarm join tokens
|
||||
TOKEN_WORKER=$(docker swarm join-token -q worker 2>/dev/null || echo "")
|
||||
TOKEN_MANAGER=$(docker swarm join-token -q manager 2>/dev/null || echo "")
|
||||
|
||||
# cluster-info.env(compose 场景下不再依赖 BINDIP/FTPIP)
|
||||
CI="$PKG_ROOT/cluster-info.env"
|
||||
info "写入 cluster-info.env (manager/token)"
|
||||
{
|
||||
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
|
||||
echo "SWARM_JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
|
||||
echo "SWARM_JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
|
||||
} > "$CI"
|
||||
info "已输出 $CI"
|
||||
|
||||
# 安装报告
|
||||
ts=$(date +%Y%m%d-%H%M%S)
|
||||
RPT="$PKG_ROOT/安装报告_${ts}.md"
|
||||
{
|
||||
echo "# Argus Server 安装报告 (${ts})"
|
||||
echo
|
||||
echo "## 端口映射"
|
||||
echo "- MASTER_PORT=${MASTER_PORT}"
|
||||
echo "- ES_HTTP_PORT=${ES_HTTP_PORT}"
|
||||
echo "- KIBANA_PORT=${KIBANA_PORT}"
|
||||
echo "- PROMETHEUS_PORT=${PROMETHEUS_PORT}"
|
||||
echo "- GRAFANA_PORT=${GRAFANA_PORT}"
|
||||
echo "- ALERTMANAGER_PORT=${ALERTMANAGER_PORT}"
|
||||
echo "- WEB_PROXY_PORT_8080=${WEB_PROXY_PORT_8080} ... 8085=${WEB_PROXY_PORT_8085}"
|
||||
echo
|
||||
echo "## Swarm/Overlay"
|
||||
echo "- SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
|
||||
echo "- NET=${NET_NAME}"
|
||||
echo "- JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
|
||||
echo "- JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
|
||||
echo
|
||||
echo "## 健康检查(简要)"
|
||||
echo "- master/readyz=$(code http://127.0.0.1:${MASTER_PORT:-32300}/readyz)"
|
||||
echo "- es/_cluster/health=$(code http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health)"
|
||||
echo "- grafana/api/health=$(code http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health)"
|
||||
echo "- prometheus/tcp=$([[ $(prom_ok; echo $?) == 0 ]] && echo 200 || echo 000)"
|
||||
echo "- alertmanager/api/v2/status=$(code http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status)"
|
||||
echo "- kibana/api/status=$([[ $(kb_ok; echo $?) == 0 ]] && echo available || echo not-ready)"
|
||||
} > "$RPT"
|
||||
info "已生成报告: $RPT"
|
||||
|
||||
info "安装完成。可将 cluster-info.env 分发给 Client-GPU 安装方。"
|
||||
docker exec argus-web-proxy nginx -t >/dev/null 2>&1 && docker exec argus-web-proxy nginx -s reload >/dev/null 2>&1 || true
|
||||
@ -1,83 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
|
||||
log() { echo -e "\033[0;34m[CHECK]\033[0m $*"; }
|
||||
err() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }
|
||||
|
||||
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
|
||||
|
||||
wait_http() { local url="$1"; local attempts=${2:-120}; local i=1; while ((i<=attempts)); do curl -fsS "$url" >/dev/null 2>&1 && return 0; echo "[..] waiting $url ($i/$attempts)"; sleep 5; ((i++)); done; return 1; }
|
||||
code_for() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
|
||||
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
|
||||
|
||||
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
|
||||
OUT_JSON="$LOG_DIR/selfcheck.json"; tmp=$(mktemp)
|
||||
|
||||
ok=1
|
||||
|
||||
log "checking overlay network"
|
||||
net_ok=false
|
||||
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" >/dev/null 2>&1; then
|
||||
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" | grep -q '"Driver": "overlay"'; then net_ok=true; fi
|
||||
fi
|
||||
[[ "$net_ok" == true ]] || ok=0
|
||||
|
||||
log "checking Elasticsearch"
|
||||
wait_http "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" 60 || ok=0
|
||||
|
||||
log "checking Kibana"
|
||||
kb_code=$(code_for "http://localhost:${KIBANA_PORT:-5601}/api/status")
|
||||
kb_ok=false
|
||||
if [[ "$kb_code" == 200 ]]; then
|
||||
body=$(curl -sS "http://localhost:${KIBANA_PORT:-5601}/api/status" || true)
|
||||
echo "$body" | grep -q '"level"\s*:\s*"available"' && kb_ok=true
|
||||
fi
|
||||
[[ "$kb_ok" == true ]] || ok=0
|
||||
|
||||
log "checking Master"
|
||||
[[ $(code_for "http://localhost:${MASTER_PORT:-32300}/readyz") == 200 ]] || ok=0
|
||||
|
||||
log "checking Prometheus"
|
||||
wait_http "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready" 60 || ok=0
|
||||
|
||||
log "checking Grafana"
|
||||
gf_code=$(code_for "http://localhost:${GRAFANA_PORT:-3000}/api/health")
|
||||
gf_ok=false; if [[ "$gf_code" == 200 ]]; then body=$(curl -sS "http://localhost:${GRAFANA_PORT:-3000}/api/health" || true); echo "$body" | grep -q '"database"\s*:\s*"ok"' && gf_ok=true; fi
|
||||
[[ "$gf_ok" == true ]] || ok=0
|
||||
|
||||
log "checking Alertmanager"
|
||||
wait_http "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status" 60 || ok=0
|
||||
|
||||
log "checking Web-Proxy (CORS)"
|
||||
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
|
||||
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
|
||||
wp_ok=true
|
||||
[[ -n "$cors8084" && -n "$cors8085" ]] || wp_ok=false
|
||||
[[ "$wp_ok" == true ]] || ok=0
|
||||
|
||||
cat > "$tmp" <<JSON
|
||||
{
|
||||
"overlay_net": $net_ok,
|
||||
"es": true,
|
||||
"kibana": $kb_ok,
|
||||
"master_readyz": true,
|
||||
"prometheus": true,
|
||||
"grafana": $gf_ok,
|
||||
"alertmanager": true,
|
||||
"web_proxy_cors": $wp_ok,
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
}
|
||||
JSON
|
||||
|
||||
mv "$tmp" "$OUT_JSON" 2>/dev/null || cp "$tmp" "$OUT_JSON"
|
||||
|
||||
if [[ "$ok" == 1 ]]; then
|
||||
log "selfcheck OK -> $OUT_JSON"
|
||||
exit 0
|
||||
else
|
||||
err "selfcheck FAILED -> $OUT_JSON"
|
||||
exit 1
|
||||
fi
|
||||
@ -1,9 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
|
||||
@ -1,23 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
# load COMPOSE_PROJECT_NAME from env file if present
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require_compose
|
||||
|
||||
echo "[UNINSTALL] stopping compose (project=$PROJECT)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
|
||||
echo "[UNINSTALL] done"
|
||||
Binary file not shown.
@ -37,11 +37,22 @@ _argus_is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
_argus_read_user_from_files() {
|
||||
local uid_out_var="$1" gid_out_var="$2"; shift 2
|
||||
local uid_val="$ARGUS_BUILD_UID_DEFAULT" gid_val="$ARGUS_BUILD_GID_DEFAULT"
|
||||
local config
|
||||
for config in "$@"; do
|
||||
load_build_user() {
|
||||
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
local project_root config_files config uid gid
|
||||
project_root="$(argus_project_root)"
|
||||
config_files=(
|
||||
"$project_root/configs/build_user.local.conf"
|
||||
"$project_root/configs/build_user.conf"
|
||||
)
|
||||
|
||||
uid="$ARGUS_BUILD_UID_DEFAULT"
|
||||
gid="$ARGUS_BUILD_GID_DEFAULT"
|
||||
|
||||
for config in "${config_files[@]}"; do
|
||||
if [[ -f "$config" ]]; then
|
||||
while IFS= read -r raw_line || [[ -n "$raw_line" ]]; do
|
||||
local line key value
|
||||
@ -57,58 +68,42 @@ _argus_read_user_from_files() {
|
||||
key="$(_argus_trim "$key")"
|
||||
value="$(_argus_trim "$value")"
|
||||
case "$key" in
|
||||
UID) uid_val="$value" ;;
|
||||
GID) gid_val="$value" ;;
|
||||
*) echo "[ARGUS build_user] Unknown key '$key' in $config" >&2 ;;
|
||||
UID)
|
||||
uid="$value"
|
||||
;;
|
||||
GID)
|
||||
gid="$value"
|
||||
;;
|
||||
*)
|
||||
echo "[ARGUS build_user] Unknown key '$key' in $config" >&2
|
||||
;;
|
||||
esac
|
||||
done < "$config"
|
||||
break
|
||||
fi
|
||||
done
|
||||
printf -v "$uid_out_var" '%s' "$uid_val"
|
||||
printf -v "$gid_out_var" '%s' "$gid_val"
|
||||
}
|
||||
|
||||
load_build_user_profile() {
|
||||
local profile="${1:-default}"
|
||||
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
|
||||
return 0
|
||||
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then
|
||||
uid="$ARGUS_BUILD_UID"
|
||||
fi
|
||||
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then
|
||||
gid="$ARGUS_BUILD_GID"
|
||||
fi
|
||||
local project_root uid gid
|
||||
project_root="$(argus_project_root)"
|
||||
case "$profile" in
|
||||
pkg)
|
||||
_argus_read_user_from_files uid gid \
|
||||
"$project_root/configs/build_user.pkg.conf" \
|
||||
"$project_root/configs/build_user.local.conf" \
|
||||
"$project_root/configs/build_user.conf"
|
||||
;;
|
||||
default|*)
|
||||
_argus_read_user_from_files uid gid \
|
||||
"$project_root/configs/build_user.local.conf" \
|
||||
"$project_root/configs/build_user.conf"
|
||||
;;
|
||||
esac
|
||||
|
||||
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then uid="$ARGUS_BUILD_UID"; fi
|
||||
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then gid="$ARGUS_BUILD_GID"; fi
|
||||
|
||||
if ! _argus_is_number "$uid"; then
|
||||
echo "[ARGUS build_user] Invalid UID '$uid'" >&2; return 1
|
||||
echo "[ARGUS build_user] Invalid UID '$uid'" >&2
|
||||
return 1
|
||||
fi
|
||||
if ! _argus_is_number "$gid"; then
|
||||
echo "[ARGUS build_user] Invalid GID '$gid'" >&2; return 1
|
||||
echo "[ARGUS build_user] Invalid GID '$gid'" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
export ARGUS_BUILD_UID="$uid"
|
||||
export ARGUS_BUILD_GID="$gid"
|
||||
_ARGUS_BUILD_USER_LOADED=1
|
||||
}
|
||||
|
||||
load_build_user() {
|
||||
local profile="${ARGUS_BUILD_PROFILE:-default}"
|
||||
load_build_user_profile "$profile"
|
||||
}
|
||||
|
||||
argus_build_user_args() {
|
||||
load_build_user
|
||||
printf '%s' "--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} --build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID}"
|
||||
|
||||
1
src/agent/.gitignore
vendored
1
src/agent/.gitignore
vendored
@ -3,4 +3,3 @@ build/
|
||||
__pycache__/
|
||||
|
||||
.env
|
||||
dist/
|
||||
|
||||
@ -4,7 +4,6 @@ import os
|
||||
import re
|
||||
import socket
|
||||
import subprocess
|
||||
import ipaddress
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict
|
||||
|
||||
@ -17,47 +16,11 @@ _HOSTNAME_PATTERN = re.compile(r"^([^-]+)-([^-]+)-([^-]+)-.*$")
|
||||
|
||||
|
||||
def collect_metadata(config: AgentConfig) -> Dict[str, Any]:
|
||||
"""汇总节点注册需要的静态信息,带有更智能的 IP 选择。
|
||||
|
||||
规则(从高到低):
|
||||
1) AGENT_PUBLISH_IP 指定;
|
||||
2) Hostname A 记录(若命中优先网段);
|
||||
3) 网卡扫描:排除 AGENT_EXCLUDE_IFACES,优先 AGENT_PREFER_NET_CIDRS;
|
||||
4) 默认路由回退(UDP socket 技巧)。
|
||||
|
||||
额外发布:overlay_ip / gwbridge_ip / interfaces,便于 Master 与诊断使用。
|
||||
"""
|
||||
"""汇总节点注册需要的静态信息。"""
|
||||
hostname = config.hostname
|
||||
|
||||
prefer_cidrs = _read_cidrs_env(
|
||||
os.environ.get("AGENT_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16")
|
||||
)
|
||||
exclude_ifaces = _read_csv_env(
|
||||
os.environ.get("AGENT_EXCLUDE_IFACES", "docker_gwbridge,lo")
|
||||
)
|
||||
|
||||
# interface inventory
|
||||
interfaces = _list_global_ipv4_addrs()
|
||||
if exclude_ifaces:
|
||||
interfaces = [it for it in interfaces if it[0] not in set(exclude_ifaces)]
|
||||
|
||||
# resolve hostname candidates
|
||||
host_ips = _resolve_hostname_ips(hostname)
|
||||
|
||||
selected_ip, overlay_ip, gwbridge_ip = _select_publish_ips(
|
||||
interfaces=interfaces,
|
||||
host_ips=host_ips,
|
||||
prefer_cidrs=prefer_cidrs,
|
||||
)
|
||||
|
||||
meta: Dict[str, Any] = {
|
||||
meta = {
|
||||
"hostname": hostname,
|
||||
"ip": os.environ.get("AGENT_PUBLISH_IP", selected_ip), # keep required field
|
||||
"overlay_ip": overlay_ip or selected_ip,
|
||||
"gwbridge_ip": gwbridge_ip,
|
||||
"interfaces": [
|
||||
{"iface": name, "ip": ip} for name, ip in interfaces
|
||||
],
|
||||
"ip": _detect_ip_address(),
|
||||
"env": config.environment,
|
||||
"user": config.user,
|
||||
"instance": config.instance,
|
||||
@ -133,7 +96,7 @@ def _detect_gpu_count() -> int:
|
||||
|
||||
|
||||
def _detect_ip_address() -> str:
|
||||
"""保留旧接口,作为最终回退:默认路由源地址 → 主机名解析 → 127.0.0.1。"""
|
||||
"""尝试通过 UDP socket 获得容器出口 IP,失败则回退解析主机名。"""
|
||||
try:
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
|
||||
sock.connect(("8.8.8.8", 80))
|
||||
@ -145,118 +108,3 @@ def _detect_ip_address() -> str:
|
||||
except OSError:
|
||||
LOGGER.warning("Unable to resolve hostname to IP; defaulting to 127.0.0.1")
|
||||
return "127.0.0.1"
|
||||
|
||||
|
||||
def _read_csv_env(raw: str | None) -> list[str]:
|
||||
if not raw:
|
||||
return []
|
||||
return [x.strip() for x in raw.split(",") if x.strip()]
|
||||
|
||||
|
||||
def _read_cidrs_env(raw: str | None) -> list[ipaddress.IPv4Network]:
|
||||
cidrs: list[ipaddress.IPv4Network] = []
|
||||
for item in _read_csv_env(raw):
|
||||
try:
|
||||
net = ipaddress.ip_network(item, strict=False)
|
||||
if isinstance(net, (ipaddress.IPv4Network,)):
|
||||
cidrs.append(net)
|
||||
except ValueError:
|
||||
LOGGER.warning("Ignoring invalid CIDR in AGENT_PREFER_NET_CIDRS", extra={"cidr": item})
|
||||
return cidrs
|
||||
|
||||
|
||||
def _list_global_ipv4_addrs() -> list[tuple[str, str]]:
|
||||
"""列出 (iface, ip) 形式的全局 IPv4 地址。
|
||||
依赖 iproute2:ip -4 -o addr show scope global
|
||||
"""
|
||||
results: list[tuple[str, str]] = []
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
["sh", "-lc", "ip -4 -o addr show scope global | awk '{print $2, $4}'"],
|
||||
check=False,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
timeout=3,
|
||||
)
|
||||
if proc.returncode == 0:
|
||||
for line in proc.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
parts = line.split()
|
||||
if len(parts) != 2:
|
||||
continue
|
||||
iface, cidr = parts
|
||||
ip = cidr.split("/")[0]
|
||||
try:
|
||||
ipaddress.IPv4Address(ip)
|
||||
except ValueError:
|
||||
continue
|
||||
results.append((iface, ip))
|
||||
except Exception as exc: # pragma: no cover - defensive
|
||||
LOGGER.debug("Failed to list interfaces", extra={"error": str(exc)})
|
||||
return results
|
||||
|
||||
|
||||
def _resolve_hostname_ips(name: str) -> list[str]:
|
||||
ips: list[str] = []
|
||||
try:
|
||||
infos = socket.getaddrinfo(name, None, family=socket.AF_INET)
|
||||
for info in infos:
|
||||
ip = info[4][0]
|
||||
if ip not in ips:
|
||||
ips.append(ip)
|
||||
except OSError:
|
||||
pass
|
||||
return ips
|
||||
|
||||
|
||||
def _pick_by_cidrs(candidates: list[str], prefer_cidrs: list[ipaddress.IPv4Network]) -> str | None:
|
||||
for net in prefer_cidrs:
|
||||
for ip in candidates:
|
||||
try:
|
||||
if ipaddress.ip_address(ip) in net:
|
||||
return ip
|
||||
except ValueError:
|
||||
continue
|
||||
return None
|
||||
|
||||
|
||||
def _select_publish_ips(
|
||||
*,
|
||||
interfaces: list[tuple[str, str]],
|
||||
host_ips: list[str],
|
||||
prefer_cidrs: list[ipaddress.IPv4Network],
|
||||
) -> tuple[str, str | None, str | None]:
|
||||
"""返回 (selected_ip, overlay_ip, gwbridge_ip)。
|
||||
|
||||
- overlay_ip:优先命中 prefer_cidrs(10.0/8 先于 172.31/16)。
|
||||
- gwbridge_ip:若存在 172.22/16 则记录。
|
||||
- selected_ip:优先 AGENT_PUBLISH_IP;否则 overlay_ip;否则 hostname A 记录中的 prefer;否则默认路由回退。
|
||||
"""
|
||||
# detect gwbridge (172.22/16)
|
||||
gwbridge_net = ipaddress.ip_network("172.22.0.0/16")
|
||||
gwbridge_ip = None
|
||||
for _, ip in interfaces:
|
||||
try:
|
||||
if ipaddress.ip_address(ip) in gwbridge_net:
|
||||
gwbridge_ip = ip
|
||||
break
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
# overlay candidate from interfaces by prefer cidrs
|
||||
iface_ips = [ip for _, ip in interfaces]
|
||||
overlay_ip = _pick_by_cidrs(iface_ips, prefer_cidrs)
|
||||
|
||||
# hostname A records filtered by prefer cidrs
|
||||
host_pref = _pick_by_cidrs(host_ips, prefer_cidrs)
|
||||
|
||||
env_ip = os.environ.get("AGENT_PUBLISH_IP")
|
||||
if env_ip:
|
||||
selected = env_ip
|
||||
else:
|
||||
selected = overlay_ip or host_pref or _detect_ip_address()
|
||||
|
||||
return selected, overlay_ip, gwbridge_ip
|
||||
|
||||
BIN
src/agent/dist/argus-agent
vendored
Executable file
BIN
src/agent/dist/argus-agent
vendored
Executable file
Binary file not shown.
@ -12,8 +12,6 @@ VENV_DIR="$BUILD_ROOT/venv"
|
||||
|
||||
AGENT_BUILD_IMAGE="${AGENT_BUILD_IMAGE:-python:3.11-slim-bullseye}"
|
||||
AGENT_BUILD_USE_DOCKER="${AGENT_BUILD_USE_DOCKER:-1}"
|
||||
# 默认在容器内忽略代理以避免公司内网代理在 Docker 网络不可达导致 pip 失败(可用 0 关闭)
|
||||
AGENT_BUILD_IGNORE_PROXY="${AGENT_BUILD_IGNORE_PROXY:-1}"
|
||||
USED_DOCKER=0
|
||||
|
||||
run_host_build() {
|
||||
@ -73,7 +71,6 @@ run_docker_build() {
|
||||
pass_env_if_set http_proxy
|
||||
pass_env_if_set https_proxy
|
||||
pass_env_if_set no_proxy
|
||||
pass_env_if_set AGENT_BUILD_IGNORE_PROXY
|
||||
|
||||
build_script=$(cat <<'INNER'
|
||||
set -euo pipefail
|
||||
@ -85,10 +82,6 @@ rm -rf build dist
|
||||
mkdir -p build/pyinstaller dist
|
||||
python3 -m venv --copies build/venv
|
||||
source build/venv/bin/activate
|
||||
# 若指定忽略代理,则清空常见代理与 pip 镜像环境变量,避免容器内代理不可达
|
||||
if [ "${AGENT_BUILD_IGNORE_PROXY:-1}" = "1" ]; then
|
||||
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY PIP_INDEX_URL PIP_EXTRA_INDEX_URL PIP_TRUSTED_HOST
|
||||
fi
|
||||
pip install --upgrade pip
|
||||
pip install .
|
||||
pip install pyinstaller==6.6.0
|
||||
|
||||
@ -9,21 +9,21 @@ RUN apt-get update && \
|
||||
apt-get install -y wget supervisor net-tools inetutils-ping vim ca-certificates passwd && \
|
||||
apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# 设置 Alertmanager 版本(与本地离线包保持一致)
|
||||
# 设置 Alertmanager 版本
|
||||
ARG ALERTMANAGER_VERSION=0.28.1
|
||||
|
||||
# 使用仓库内预置的离线包构建(无需联网)
|
||||
COPY src/alert/alertmanager/build/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz /tmp/
|
||||
RUN tar xvf /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz -C /tmp && \
|
||||
mv /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64 /usr/local/alertmanager && \
|
||||
rm -f /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
|
||||
# 下载并解压 Alertmanager 二进制
|
||||
RUN wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz && \
|
||||
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz && \
|
||||
mv alertmanager-${ALERTMANAGER_VERSION}.linux-amd64 /usr/local/alertmanager && \
|
||||
rm alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
|
||||
|
||||
ENV ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
|
||||
|
||||
ARG ARGUS_BUILD_UID=2133
|
||||
ARG ARGUS_BUILD_GID=2015
|
||||
ENV ARGUS_BUILD_UID=${ARGUS_BUILD_UID}
|
||||
ENV ARGUS_BUILD_GID=${ARGUS_BUILD_GID}
|
||||
ARG ARGUS_UID=2133
|
||||
ARG ARGUS_GID=2015
|
||||
ENV ARGUS_UID=${ARGUS_UID}
|
||||
ENV ARGUS_GID=${ARGUS_GID}
|
||||
|
||||
RUN mkdir -p /usr/share/alertmanager && \
|
||||
mkdir -p ${ALERTMANAGER_BASE_PATH} && \
|
||||
@ -31,31 +31,18 @@ RUN mkdir -p /usr/share/alertmanager && \
|
||||
rm -rf /alertmanager && \
|
||||
ln -s ${ALERTMANAGER_BASE_PATH} /alertmanager
|
||||
|
||||
# 确保 ubuntu 账户存在并使用 ARGUS_BUILD_UID/GID
|
||||
RUN set -eux; \
|
||||
# 确保存在目标 GID 的组;若不存在则优先尝试将 ubuntu 组改为该 GID,否则创建新组
|
||||
if getent group "${ARGUS_BUILD_GID}" >/dev/null; then \
|
||||
:; \
|
||||
else \
|
||||
if getent group ubuntu >/dev/null; then \
|
||||
groupmod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
|
||||
else \
|
||||
groupadd -g "${ARGUS_BUILD_GID}" ubuntu || groupadd -g "${ARGUS_BUILD_GID}" argus || true; \
|
||||
fi; \
|
||||
fi; \
|
||||
# 创建或调整 ubuntu 用户
|
||||
if id ubuntu >/dev/null 2>&1; then \
|
||||
# 设置主组为目标 GID(可用 GID 数字指定)
|
||||
usermod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
|
||||
# 若目标 UID 未被占用,则更新 ubuntu 的 UID
|
||||
if [ "$(id -u ubuntu)" != "${ARGUS_BUILD_UID}" ] && ! getent passwd "${ARGUS_BUILD_UID}" >/dev/null; then \
|
||||
usermod -u "${ARGUS_BUILD_UID}" ubuntu || true; \
|
||||
fi; \
|
||||
else \
|
||||
useradd -m -s /bin/bash -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" ubuntu || true; \
|
||||
fi; \
|
||||
# 调整关键目录属主为 ubuntu UID/GID
|
||||
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /usr/share/alertmanager /alertmanager ${ALERTMANAGER_BASE_PATH} /private/argus/etc /usr/local/bin || true
|
||||
# 创建 alertmanager 用户(可自定义 UID/GID)
|
||||
# 创建 alertmanager 用户组
|
||||
RUN groupadd -g ${ARGUS_GID} alertmanager
|
||||
|
||||
# 创建 alertmanager 用户并指定组
|
||||
RUN useradd -M -s /usr/sbin/nologin -u ${ARGUS_UID} -g ${ARGUS_GID} alertmanager
|
||||
|
||||
RUN chown -R alertmanager:alertmanager /usr/share/alertmanager && \
|
||||
chown -R alertmanager:alertmanager /alertmanager && \
|
||||
chown -R alertmanager:alertmanager ${ALERTMANAGER_BASE_PATH} && \
|
||||
chown -R alertmanager:alertmanager /private/argus/etc && \
|
||||
chown -R alertmanager:alertmanager /usr/local/bin
|
||||
|
||||
# 配置内网 apt 源 (如果指定了内网选项)
|
||||
RUN if [ "$USE_INTRANET" = "true" ]; then \
|
||||
@ -99,3 +86,4 @@ EXPOSE 9093
|
||||
|
||||
# 使用 supervisor 作为入口点
|
||||
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
|
||||
|
||||
|
||||
Binary file not shown.
@ -5,9 +5,9 @@ docker pull ubuntu:24.04
|
||||
source src/alert/tests/.env
|
||||
|
||||
docker build \
|
||||
--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} \
|
||||
--build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID} \
|
||||
--build-arg ARGUS_UID=${ARGUS_UID} \
|
||||
--build-arg ARGUS_GID=${ARGUS_GID} \
|
||||
-f src/alert/alertmanager/build/Dockerfile \
|
||||
-t argus-alertmanager:latest .
|
||||
|
||||
docker save -o argus-alertmanager-latest.tar argus-alertmanager:latest
|
||||
docker save -o argus-alertmanager-latest.tar argus-alertmanager:latest
|
||||
@ -1,22 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# 下载 Alertmanager 离线安装包到本目录,用于 Docker 构建时 COPY
|
||||
# 用法:
|
||||
# ./fetch-dist.sh [version]
|
||||
# 示例:
|
||||
# ./fetch-dist.sh 0.28.1
|
||||
|
||||
VER="${1:-0.28.1}"
|
||||
OUT="alertmanager-${VER}.linux-amd64.tar.gz"
|
||||
URL="https://github.com/prometheus/alertmanager/releases/download/v${VER}/${OUT}"
|
||||
|
||||
if [[ -f "$OUT" ]]; then
|
||||
echo "[INFO] $OUT already exists, skip download"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "[INFO] Downloading $URL"
|
||||
curl -fL --retry 3 --connect-timeout 10 -o "$OUT" "$URL"
|
||||
echo "[OK] Saved to $(pwd)/$OUT"
|
||||
|
||||
@ -7,8 +7,10 @@ ALERTMANAGER_BASE_PATH=${ALERTMANAGER_BASE_PATH:-/private/argus/alert/alertmanag
|
||||
|
||||
echo "[INFO] Alertmanager base path: ${ALERTMANAGER_BASE_PATH}"
|
||||
|
||||
# 使用容器内的 /etc/alertmanager/alertmanager.yml 作为配置文件,避免写入挂载卷导致的权限问题
|
||||
echo "[INFO] Using /etc/alertmanager/alertmanager.yml as configuration"
|
||||
# 生成配置文件
|
||||
echo "[INFO] Generating Alertmanager configuration file..."
|
||||
sed "s|\${ALERTMANAGER_BASE_PATH}|${ALERTMANAGER_BASE_PATH}|g" \
|
||||
/etc/alertmanager/alertmanager.yml > ${ALERTMANAGER_BASE_PATH}/alertmanager.yml
|
||||
|
||||
|
||||
# 记录容器 IP 地址
|
||||
|
||||
@ -6,7 +6,7 @@ user=root
|
||||
|
||||
[program:alertmanager]
|
||||
command=/usr/local/bin/start-am-supervised.sh
|
||||
user=ubuntu
|
||||
user=alertmanager
|
||||
stdout_logfile=/var/log/supervisor/alertmanager.log
|
||||
stderr_logfile=/var/log/supervisor/alertmanager_error.log
|
||||
autorestart=true
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
DATA_ROOT=/home/argus/tmp/private/argus
|
||||
ARGUS_BUILD_UID=1048
|
||||
ARGUS_BUILD_GID=1048
|
||||
ARGUS_UID=1048
|
||||
ARGUS_GID=1048
|
||||
|
||||
USE_INTRANET=false
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
DATA_ROOT=/home/argus/tmp/private/argus
|
||||
ARGUS_BUILD_UID=1048
|
||||
ARGUS_BUILD_GID=1048
|
||||
ARGUS_UID=1048
|
||||
ARGUS_GID=1048
|
||||
|
||||
USE_INTRANET=false
|
||||
USE_INTRANET=false
|
||||
|
||||
19
src/alert/tests/data/alertmanager/alertmanager.yml
Normal file
19
src/alert/tests/data/alertmanager/alertmanager.yml
Normal file
@ -0,0 +1,19 @@
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
|
||||
route:
|
||||
group_by: ['alertname', 'instance'] # 分组:相同 alertname + instance 的告警合并
|
||||
group_wait: 30s # 第一个告警后,等 30s 看是否有同组告警一起发
|
||||
group_interval: 5m # 同组告警变化后,至少 5 分钟再发一次
|
||||
repeat_interval: 3h # 相同告警,3 小时重复提醒一次
|
||||
receiver: 'null'
|
||||
|
||||
receivers:
|
||||
- name: 'null'
|
||||
|
||||
inhibit_rules:
|
||||
- source_match:
|
||||
severity: 'critical' # critical 告警存在时
|
||||
target_match:
|
||||
severity: 'warning' # 抑制相同 instance 的 warning 告警
|
||||
equal: ['instance']
|
||||
0
src/alert/tests/data/alertmanager/nflog
Normal file
0
src/alert/tests/data/alertmanager/nflog
Normal file
0
src/alert/tests/data/alertmanager/silences
Normal file
0
src/alert/tests/data/alertmanager/silences
Normal file
1
src/alert/tests/data/etc/alertmanager.alert.argus.com
Normal file
1
src/alert/tests/data/etc/alertmanager.alert.argus.com
Normal file
@ -0,0 +1 @@
|
||||
172.18.0.2
|
||||
@ -4,15 +4,15 @@ services:
|
||||
context: ../../../
|
||||
dockerfile: src/alert/alertmanager/build/Dockerfile
|
||||
args:
|
||||
ARGUS_BUILD_UID: ${ARGUS_BUILD_UID:-2133}
|
||||
ARGUS_BUILD_GID: ${ARGUS_BUILD_GID:-2015}
|
||||
ARGUS_UID: ${ARGUS_UID:-2133}
|
||||
ARGUS_GID: ${ARGUS_GID:-2015}
|
||||
USE_INTRANET: ${USE_INTRANET:-false}
|
||||
image: argus-alertmanager:latest
|
||||
container_name: argus-alertmanager
|
||||
environment:
|
||||
- ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- ARGUS_UID=${ARGUS_UID:-2133}
|
||||
- ARGUS_GID=${ARGUS_GID:-2015}
|
||||
ports:
|
||||
- "${ARGUS_PORT:-9093}:9093"
|
||||
volumes:
|
||||
|
||||
19
src/alert/tests/scripts/01_bootstrap.sh
Normal file
19
src/alert/tests/scripts/01_bootstrap.sh
Normal file
@ -0,0 +1,19 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
root="$(cd "$(dirname "${BASH_SOURCE[0]}")/../" && pwd)"
|
||||
project_root="$(cd "$root/../../.." && pwd)"
|
||||
|
||||
source "$project_root/scripts/common/build_user.sh"
|
||||
load_build_user
|
||||
|
||||
# 创建新的private目录结构 (基于argus目录结构)
|
||||
echo "[INFO] Creating private directory structure for supervisor-based containers..."
|
||||
mkdir -p "$root/private/argus/alert/alertmanager"
|
||||
mkdir -p "$root/private/argus/etc/"
|
||||
|
||||
# 设置数据目录权限
|
||||
echo "[INFO] Setting permissions for data directories..."
|
||||
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" "$root/private/argus/alert/alertmanager" 2>/dev/null || true
|
||||
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" "$root/private/argus/etc" 2>/dev/null || true
|
||||
|
||||
echo "[INFO] Supervisor-based containers will manage their own scripts and configurations"
|
||||
10
src/alert/tests/scripts/02_up.sh
Normal file
10
src/alert/tests/scripts/02_up.sh
Normal file
@ -0,0 +1,10 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
compose_cmd="docker compose"
|
||||
if ! $compose_cmd version >/dev/null 2>&1; then
|
||||
if command -v docker-compose >/dev/null 2>&1; then compose_cmd="docker-compose"; else
|
||||
echo "需要 Docker Compose,请安装后重试" >&2; exit 1; fi
|
||||
fi
|
||||
$compose_cmd -p alert-mvp up -d --remove-orphans
|
||||
echo "[OK] 服务已启动:Alertmanager http://localhost:9093"
|
||||
106
src/alert/tests/scripts/03_alertmanager_add_alert.sh
Normal file
106
src/alert/tests/scripts/03_alertmanager_add_alert.sh
Normal file
@ -0,0 +1,106 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
# ==========================================================
|
||||
# Alertmanager 测试脚本
|
||||
# ==========================================================
|
||||
|
||||
ALERTMANAGER_URL="http://localhost:9093"
|
||||
TEST_ALERT_NAME_CRITICAL="NodeDown"
|
||||
TEST_ALERT_NAME_WARNING="HighCPU"
|
||||
TMP_LOG="/tmp/test-alertmanager.log"
|
||||
|
||||
# 等待参数
|
||||
am_wait_attempts=30
|
||||
am_wait_interval=2
|
||||
|
||||
GREEN="\033[1;32m"
|
||||
RED="\033[1;31m"
|
||||
YELLOW="\033[1;33m"
|
||||
RESET="\033[0m"
|
||||
|
||||
# ==========================================================
|
||||
# 函数定义
|
||||
# ==========================================================
|
||||
|
||||
wait_for_alertmanager() {
|
||||
local attempt=1
|
||||
echo "[INFO] 等待 Alertmanager 启动中..."
|
||||
while (( attempt <= am_wait_attempts )); do
|
||||
if curl -fsS "${ALERTMANAGER_URL}/api/v2/status" >/dev/null 2>&1; then
|
||||
echo -e "${GREEN}[OK] Alertmanager 已就绪 (attempt=${attempt}/${am_wait_attempts})${RESET}"
|
||||
return 0
|
||||
fi
|
||||
echo "[..] Alertmanager 尚未就绪 (${attempt}/${am_wait_attempts})"
|
||||
sleep "${am_wait_interval}"
|
||||
(( attempt++ ))
|
||||
done
|
||||
echo -e "${RED}[ERROR] Alertmanager 在 ${am_wait_attempts} 次尝试后仍未就绪${RESET}"
|
||||
return 1
|
||||
}
|
||||
|
||||
log_step() {
|
||||
echo -e "${YELLOW}==== $1 ====${RESET}"
|
||||
}
|
||||
|
||||
# ==========================================================
|
||||
# 主流程
|
||||
# ==========================================================
|
||||
|
||||
log_step "测试 Alertmanager 开始"
|
||||
echo "[INFO] Alertmanager 地址: $ALERTMANAGER_URL"
|
||||
|
||||
# Step 1: 等待 Alertmanager 启动
|
||||
wait_for_alertmanager
|
||||
|
||||
# Step 2: 触发一个critical测试告警
|
||||
echo "[INFO] 发送critical测试告警..."
|
||||
curl -fsS -X POST "${ALERTMANAGER_URL}/api/v2/alerts" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '[
|
||||
{
|
||||
"labels": {
|
||||
"alertname": "'"${TEST_ALERT_NAME_CRITICAL}"'",
|
||||
"instance": "node-1",
|
||||
"severity": "critical"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "节点 node-1 宕机"
|
||||
}
|
||||
}
|
||||
]' \
|
||||
-o "$TMP_LOG"
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo -e "${GREEN}[OK] 已成功发送critical测试告警${RESET}"
|
||||
else
|
||||
echo -e "${RED}[ERROR] critical告警发送失败!${RESET}"
|
||||
cat "$TMP_LOG"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Step 3: 触发一个warning测试告警
|
||||
echo "[INFO] 发送warning测试告警..."
|
||||
curl -fsS -X POST "${ALERTMANAGER_URL}/api/v2/alerts" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '[
|
||||
{
|
||||
"labels": {
|
||||
"alertname": "'"${TEST_ALERT_NAME_WARNING}"'",
|
||||
"instance": "node-1",
|
||||
"severity": "warning"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "节点 node-1 CPU 使用率过高"
|
||||
}
|
||||
}
|
||||
]' \
|
||||
-o "$TMP_LOG"
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo -e "${GREEN}[OK] 已成功发送warning测试告警${RESET}"
|
||||
else
|
||||
echo -e "${RED}[ERROR] warning告警发送失败!${RESET}"
|
||||
cat "$TMP_LOG"
|
||||
exit 1
|
||||
fi
|
||||
71
src/alert/tests/scripts/04_query_alerts.sh
Normal file
71
src/alert/tests/scripts/04_query_alerts.sh
Normal file
@ -0,0 +1,71 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
# ==========================================================
|
||||
# Alertmanager 测试脚本(含启动等待)
|
||||
# ==========================================================
|
||||
|
||||
ALERTMANAGER_URL="http://localhost:9093"
|
||||
TEST_ALERT_NAME_CRITICAL="NodeDown"
|
||||
TEST_ALERT_NAME_WARNING="HighCPU"
|
||||
TMP_LOG="/tmp/test-alertmanager.log"
|
||||
|
||||
# 等待参数
|
||||
am_wait_attempts=30
|
||||
am_wait_interval=2
|
||||
|
||||
GREEN="\033[1;32m"
|
||||
RED="\033[1;31m"
|
||||
YELLOW="\033[1;33m"
|
||||
RESET="\033[0m"
|
||||
|
||||
# ==========================================================
|
||||
# 函数定义
|
||||
# ==========================================================
|
||||
|
||||
wait_for_alertmanager() {
|
||||
local attempt=1
|
||||
echo "[INFO] 等待 Alertmanager 启动中..."
|
||||
while (( attempt <= am_wait_attempts )); do
|
||||
if curl -fsS "${ALERTMANAGER_URL}/api/v2/status" >/dev/null 2>&1; then
|
||||
echo -e "${GREEN}[OK] Alertmanager 已就绪 (attempt=${attempt}/${am_wait_attempts})${RESET}"
|
||||
return 0
|
||||
fi
|
||||
echo "[..] Alertmanager 尚未就绪 (${attempt}/${am_wait_attempts})"
|
||||
sleep "${am_wait_interval}"
|
||||
(( attempt++ ))
|
||||
done
|
||||
echo -e "${RED}[ERROR] Alertmanager 在 ${am_wait_attempts} 次尝试后仍未就绪${RESET}"
|
||||
return 1
|
||||
}
|
||||
|
||||
log_step() {
|
||||
echo -e "${YELLOW}==== $1 ====${RESET}"
|
||||
}
|
||||
|
||||
# ==========================================================
|
||||
# 主流程
|
||||
# ==========================================================
|
||||
|
||||
log_step "查询 Alertmanager 当前告警列表开始"
|
||||
echo "[INFO] Alertmanager 地址: $ALERTMANAGER_URL"
|
||||
|
||||
# Step 1: 等待 Alertmanager 启动
|
||||
wait_for_alertmanager
|
||||
|
||||
# Step 2: 查询当前告警列表
|
||||
echo "[INFO] 查询当前告警..."
|
||||
sleep 1
|
||||
curl -fsS "${ALERTMANAGER_URL}/api/v2/alerts" | jq '.' || {
|
||||
echo -e "${RED}[WARN] 无法解析返回 JSON,请检查 jq 是否安装${RESET}"
|
||||
curl -s "${ALERTMANAGER_URL}/api/v2/alerts"
|
||||
}
|
||||
|
||||
# Step 3: 检查告警是否包含 NodeDown
|
||||
if curl -fsS "${ALERTMANAGER_URL}/api/v2/alerts" | grep -q "${TEST_ALERT_NAME_CRITICAL}"; then
|
||||
echo -e "${GREEN}✅ 测试通过:Alertmanager 已成功接收告警 ${TEST_ALERT_NAME_CRITICAL}${RESET}"
|
||||
else
|
||||
echo -e "${RED}❌ 测试失败:未检测到告警 ${TEST_ALERT_NAME_CRITICAL}${RESET}"
|
||||
fi
|
||||
|
||||
log_step "测试结束"
|
||||
21
src/alert/tests/scripts/05_down.sh
Normal file
21
src/alert/tests/scripts/05_down.sh
Normal file
@ -0,0 +1,21 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
compose_cmd="docker compose"
|
||||
if ! $compose_cmd version >/dev/null 2>&1; then
|
||||
if command -v docker-compose >/dev/null 2>&1; then compose_cmd="docker-compose"; else
|
||||
echo "需要 Docker Compose,请安装后重试" >&2; exit 1; fi
|
||||
fi
|
||||
$compose_cmd -p alert-mvp down
|
||||
echo "[OK] 已停止所有容器"
|
||||
|
||||
# 清理private目录内容
|
||||
echo "[INFO] 清理private目录内容..."
|
||||
cd "$(dirname "$0")/.."
|
||||
if [ -d "private" ]; then
|
||||
# 删除private目录及其所有内容
|
||||
rm -rf private
|
||||
echo "[OK] 已清理private目录"
|
||||
else
|
||||
echo "[INFO] private目录不存在,无需清理"
|
||||
fi
|
||||
105
src/alert/tests/scripts/e2e_test.sh
Normal file
105
src/alert/tests/scripts/e2e_test.sh
Normal file
@ -0,0 +1,105 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "======================================="
|
||||
echo "ARGUS Alert System End-to-End Test"
|
||||
echo "======================================="
|
||||
echo ""
|
||||
|
||||
# 记录测试开始时间
|
||||
test_start_time=$(date +%s)
|
||||
|
||||
# 函数:等待服务就绪
|
||||
wait_for_services() {
|
||||
echo "[INFO] Waiting for all services to be ready..."
|
||||
local max_attempts=${SERVICE_WAIT_ATTEMPTS:-120}
|
||||
local attempt=1
|
||||
|
||||
while [ $attempt -le $max_attempts ]; do
|
||||
if curl -fs http://localhost:9093/api/v2/status >/dev/null 2>&1; then
|
||||
echo "[OK] All services are ready!"
|
||||
return 0
|
||||
fi
|
||||
echo " Waiting for services... ($attempt/$max_attempts)"
|
||||
sleep 5
|
||||
((attempt++))
|
||||
done
|
||||
|
||||
echo "[ERROR] Services not ready after $max_attempts attempts"
|
||||
return 1
|
||||
}
|
||||
|
||||
# 函数:显示测试步骤
|
||||
show_step() {
|
||||
echo ""
|
||||
echo "🔄 Step $1: $2"
|
||||
echo "----------------------------------------"
|
||||
}
|
||||
|
||||
# 函数:验证步骤结果
|
||||
verify_step() {
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "✅ $1 - SUCCESS"
|
||||
else
|
||||
echo "❌ $1 - FAILED"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# 开始端到端测试
|
||||
show_step "1" "Bootstrap - Initialize environment"
|
||||
./scripts/01_bootstrap.sh
|
||||
verify_step "Bootstrap"
|
||||
|
||||
show_step "2" "Startup - Start all services"
|
||||
./scripts/02_up.sh
|
||||
verify_step "Service startup"
|
||||
|
||||
# 等待服务完全就绪
|
||||
wait_for_services || exit 1
|
||||
|
||||
# 发送告警数据
|
||||
show_step "3" "Add alerts - Send test alerts to Alertmanager"
|
||||
./scripts/03_alertmanager_add_alert.sh
|
||||
verify_step "Send test alerts"
|
||||
|
||||
# 查询告警数据
|
||||
show_step "4" "Verify data - Query Alertmanager"
|
||||
./scripts/04_query_alerts.sh
|
||||
verify_step "Data verification"
|
||||
|
||||
|
||||
# 检查服务健康状态
|
||||
show_step "Health" "Check service health"
|
||||
echo "[INFO] Checking service health..."
|
||||
|
||||
# 检查 Alertmanager 状态
|
||||
if curl -fs "http://localhost:9093/api/v2/status" >/dev/null 2>&1; then
|
||||
am_status="available"
|
||||
echo "✅ Alertmanager status: $am_status"
|
||||
else
|
||||
am_status="unavailable"
|
||||
echo "⚠️ Alertmanager status: $am_status"
|
||||
fi
|
||||
verify_step "Service health check"
|
||||
|
||||
# 清理环境
|
||||
show_step "5" "Cleanup - Stop all services"
|
||||
./scripts/05_down.sh
|
||||
verify_step "Service cleanup"
|
||||
|
||||
# 计算总测试时间
|
||||
test_end_time=$(date +%s)
|
||||
total_time=$((test_end_time - test_start_time))
|
||||
|
||||
echo ""
|
||||
echo "======================================="
|
||||
echo "🎉 END-TO-END TEST COMPLETED SUCCESSFULLY!"
|
||||
echo "======================================="
|
||||
echo "📊 Test Summary:"
|
||||
echo " • Total time: ${total_time}s"
|
||||
echo " • Alertmanager status: $am_status"
|
||||
echo " • All services started and stopped successfully"
|
||||
echo ""
|
||||
echo "✅ The ARGUS Alert system is working correctly!"
|
||||
echo ""
|
||||
@ -1,113 +0,0 @@
|
||||
#!/bin/bash
|
||||
# verify_alertmanager.sh
|
||||
# 用于部署后验证 Prometheus 与 Alertmanager 通信链路是否正常
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
#=============================
|
||||
# 基础配置
|
||||
#=============================
|
||||
PROM_URL="${PROM_URL:-http://prom.metric.argus.com:9090}"
|
||||
ALERT_URL="${ALERT_URL:-http://alertmanager.alert.argus.com:9093}"
|
||||
# TODO: 根据实际部署环境调整规则目录
|
||||
DATA_ROOT="${DATA_ROOT:-/private/argus}"
|
||||
RULE_DIR = "$DATA_ROOT/metric/prometheus/rules"
|
||||
TMP_RULE="/tmp/test_rule.yml"
|
||||
|
||||
#=============================
|
||||
# 辅助函数
|
||||
#=============================
|
||||
GREEN="\033[32m"; RED="\033[31m"; YELLOW="\033[33m"; RESET="\033[0m"
|
||||
|
||||
log_info() { echo -e "${YELLOW}[INFO]${RESET} $1"; }
|
||||
log_success() { echo -e "${GREEN}[OK]${RESET} $1"; }
|
||||
log_error() { echo -e "${RED}[ERROR]${RESET} $1"; }
|
||||
|
||||
fail_exit() { log_error "$1"; exit 1; }
|
||||
|
||||
#=============================
|
||||
# Step 1: 检查 Alertmanager 是否可访问
|
||||
#=============================
|
||||
log_info "检查 Alertmanager 状态..."
|
||||
if curl -sSf "${ALERT_URL}/api/v2/status" >/dev/null 2>&1; then
|
||||
log_success "Alertmanager 服务正常 (${ALERT_URL})"
|
||||
else
|
||||
fail_exit "无法访问 Alertmanager,请检查端口映射与容器状态。"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 2: 手动发送测试告警
|
||||
#=============================
|
||||
log_info "发送手动测试告警..."
|
||||
curl -s -XPOST "${ALERT_URL}/api/v2/alerts" -H "Content-Type: application/json" -d '[
|
||||
{
|
||||
"labels": {
|
||||
"alertname": "ManualTestAlert",
|
||||
"severity": "info"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "This is a test alert from deploy verification"
|
||||
},
|
||||
"startsAt": "'$(date -Iseconds)'"
|
||||
}
|
||||
]' >/dev/null && log_success "测试告警已成功发送到 Alertmanager"
|
||||
|
||||
#=============================
|
||||
# Step 3: 检查 Prometheus 配置中是否包含 Alertmanager
|
||||
#=============================
|
||||
log_info "检查 Prometheus 是否配置了 Alertmanager..."
|
||||
if curl -s "${PROM_URL}/api/v1/status/config" | grep -q "alertmanagers"; then
|
||||
log_success "Prometheus 已配置 Alertmanager 目标"
|
||||
else
|
||||
fail_exit "Prometheus 未配置 Alertmanager,请检查 prometheus.yml"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 4: 创建并加载测试告警规则
|
||||
#=============================
|
||||
log_info "创建临时测试规则 ${TMP_RULE} ..."
|
||||
cat <<EOF > "${TMP_RULE}"
|
||||
groups:
|
||||
- name: deploy-verify-group
|
||||
rules:
|
||||
- alert: DeployVerifyAlert
|
||||
expr: vector(1)
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Deployment verification alert"
|
||||
EOF
|
||||
|
||||
mkdir -p "${RULE_DIR}"
|
||||
cp "${TMP_RULE}" "${RULE_DIR}/test_rule.yml"
|
||||
|
||||
log_info "重载 Prometheus 以加载新规则..."
|
||||
if curl -s -X POST "${PROM_URL}/-/reload" >/dev/null; then
|
||||
log_success "Prometheus 已重载规则"
|
||||
else
|
||||
fail_exit "Prometheus reload 失败,请检查 API 可访问性。"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 5: 等待并验证 Alertmanager 是否收到告警
|
||||
#=============================
|
||||
log_info "等待告警触发 (约5秒)..."
|
||||
sleep 5
|
||||
|
||||
if curl -s "${ALERT_URL}/api/v2/alerts" | grep -q "DeployVerifyAlert"; then
|
||||
log_success "Prometheus → Alertmanager 告警链路验证成功"
|
||||
else
|
||||
fail_exit "未在 Alertmanager 中检测到 DeployVerifyAlert,请检查网络或配置。"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 6: 清理测试规则
|
||||
#=============================
|
||||
log_info "清理临时测试规则..."
|
||||
rm -f "${RULE_DIR}/test_rule.yml" "${TMP_RULE}"
|
||||
|
||||
curl -s -X POST "${PROM_URL}/-/reload" >/dev/null \
|
||||
&& log_success "Prometheus 已清理验证规则" \
|
||||
|| log_error "Prometheus reload 清理失败,请手动确认。"
|
||||
|
||||
log_success "部署验证全部通过!Prometheus ↔ Alertmanager 通信正常。"
|
||||
@ -26,7 +26,6 @@ RUN apt-get update && \
|
||||
apt-get install -y \
|
||||
bind9 \
|
||||
bind9utils \
|
||||
dnsutils \
|
||||
bind9-doc \
|
||||
supervisor \
|
||||
net-tools \
|
||||
|
||||
1
src/bundle/cpu-node-bundle/.gitignore
vendored
1
src/bundle/cpu-node-bundle/.gitignore
vendored
@ -1 +0,0 @@
|
||||
.build*/
|
||||
@ -1,33 +0,0 @@
|
||||
FROM ubuntu:22.04
|
||||
|
||||
ARG ARGUS_BUILD_UID=2133
|
||||
ARG ARGUS_BUILD_GID=2015
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive \
|
||||
TZ=Asia/Shanghai \
|
||||
ARGUS_LOGS_WORLD_WRITABLE=1
|
||||
|
||||
RUN set -eux; \
|
||||
apt-get update; \
|
||||
apt-get install -y --no-install-recommends \
|
||||
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata \
|
||||
cron procps supervisor vim less tar gzip python3; \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||
|
||||
WORKDIR /
|
||||
|
||||
# Offline fluent-bit assets and bundle tarball are staged by the build script
|
||||
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
|
||||
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
|
||||
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
|
||||
COPY private/etc /private/etc
|
||||
COPY private/packages /private/packages
|
||||
COPY bundle/ /bundle/
|
||||
|
||||
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
|
||||
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
|
||||
if [ "${ARGUS_LOGS_WORLD_WRITABLE}" = "1" ]; then chmod 1777 /logs/train /logs/infer || true; else chmod 755 /logs/train /logs/infer || true; fi; \
|
||||
chmod 770 /buffers || true
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]
|
||||
@ -1,59 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# health-watcher.sh (CPU node bundle)
|
||||
# 周期执行 check_health.sh 与 restart_unhealthy.sh,用于节点容器内自愈。
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
|
||||
VER_DIR="${1:-}"
|
||||
|
||||
log(){ echo "[HEALTH-WATCHER] $*"; }
|
||||
|
||||
resolve_ver_dir() {
|
||||
local dir=""
|
||||
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
|
||||
dir="$VER_DIR"
|
||||
elif [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$dir" ]]; then
|
||||
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
echo "$dir"
|
||||
}
|
||||
|
||||
main() {
|
||||
log "starting with interval=${INTERVAL}s"
|
||||
local dir
|
||||
dir="$(resolve_ver_dir)"
|
||||
if [[ -z "$dir" || ! -d "$dir" ]]; then
|
||||
log "no valid install dir found under $INSTALL_ROOT; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
local chk="$dir/check_health.sh"
|
||||
local rst="$dir/restart_unhealthy.sh"
|
||||
|
||||
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
|
||||
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "watching install dir: $dir"
|
||||
|
||||
while :; do
|
||||
if [[ -x "$chk" ]]; then
|
||||
log "running check_health.sh"
|
||||
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
|
||||
fi
|
||||
if [[ -x "$rst" ]]; then
|
||||
log "running restart_unhealthy.sh"
|
||||
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
|
||||
fi
|
||||
sleep "$INTERVAL"
|
||||
done
|
||||
}
|
||||
|
||||
main "$@"
|
||||
|
||||
@ -1,131 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[BOOT] CPU node bundle starting"
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
BUNDLE_DIR="/bundle"
|
||||
STATE_DIR_BASE="/private/argus/agent"
|
||||
|
||||
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
|
||||
|
||||
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
|
||||
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
|
||||
chmod 1777 /logs/train /logs/infer || true
|
||||
else
|
||||
chmod 755 /logs/train /logs/infer || true
|
||||
fi
|
||||
chmod 770 /buffers || true
|
||||
|
||||
installed_ok=0
|
||||
|
||||
# 1) already installed?
|
||||
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
|
||||
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
|
||||
else
|
||||
# 2) try local bundle first (argus-metric_*.tar.gz)
|
||||
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
|
||||
if [[ -n "${tarball:-}" ]]; then
|
||||
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
|
||||
tmp=$(mktemp -d)
|
||||
tar -xzf "$tarball" -C "$tmp"
|
||||
# locate root containing version.json
|
||||
root="$tmp"
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
|
||||
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
|
||||
fi
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
|
||||
else
|
||||
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
|
||||
if [[ -z "$ver" ]]; then
|
||||
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
|
||||
else
|
||||
target_root="$INSTALL_ROOT"
|
||||
version_dir="$target_root/versions/$ver"
|
||||
mkdir -p "$version_dir"
|
||||
shopt -s dotglob
|
||||
mv "$root"/* "$version_dir/" 2>/dev/null || true
|
||||
shopt -u dotglob
|
||||
if [[ -f "$version_dir/install.sh" ]]; then
|
||||
chmod +x "$version_dir/install.sh" 2>/dev/null || true
|
||||
(
|
||||
export AUTO_START_DCGM="0" # N/A on CPU
|
||||
cd "$version_dir" && ./install.sh "$version_dir"
|
||||
)
|
||||
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
|
||||
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
|
||||
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
|
||||
installed_ok=1
|
||||
echo "[BOOT] local bundle install OK: version=$ver"
|
||||
else
|
||||
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3) fallback: use FTP setup if not installed
|
||||
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
|
||||
echo "[BOOT] fallback to FTP setup"
|
||||
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
|
||||
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
|
||||
exit 1
|
||||
fi
|
||||
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
|
||||
chmod +x /tmp/setup.sh
|
||||
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
|
||||
fi
|
||||
fi
|
||||
|
||||
# 4) ensure argus-agent is running (best-effort)
|
||||
if ! pgrep -x argus-agent >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting argus-agent (not detected)"
|
||||
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
|
||||
fi
|
||||
|
||||
# 5) post-install selfcheck and state
|
||||
ver_dir=""
|
||||
if [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$ver_dir" ]]; then
|
||||
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
|
||||
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
|
||||
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
|
||||
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
|
||||
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
|
||||
fi
|
||||
|
||||
host="$(hostname)"
|
||||
state_dir="$STATE_DIR_BASE/${host}"
|
||||
mkdir -p "$state_dir" 2>/dev/null || true
|
||||
for i in {1..60}; do
|
||||
if [[ -s "$state_dir/node.json" ]]; then
|
||||
echo "[BOOT] node state present: $state_dir/node.json"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# 6) spawn health watcher (best-effort, non-blocking)
|
||||
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting health watcher for $ver_dir"
|
||||
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
|
||||
else
|
||||
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
|
||||
fi
|
||||
|
||||
echo "[BOOT] ready; entering sleep"
|
||||
exec sleep infinity
|
||||
1
src/bundle/gpu-node-bundle/.gitignore
vendored
1
src/bundle/gpu-node-bundle/.gitignore
vendored
@ -1 +0,0 @@
|
||||
.build*/
|
||||
@ -1,44 +0,0 @@
|
||||
ARG CUDA_VER=12.2.2
|
||||
FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu22.04
|
||||
|
||||
ARG CLIENT_VER=0.0.0
|
||||
ARG BUNDLE_DATE=00000000
|
||||
|
||||
LABEL org.opencontainers.image.title="argus-sys-metric-test-node-bundle-gpu" \
|
||||
org.opencontainers.image.description="GPU node bundle with embedded Argus client artifact" \
|
||||
org.opencontainers.image.version="${CLIENT_VER}" \
|
||||
org.opencontainers.image.revision_date="${BUNDLE_DATE}" \
|
||||
maintainer="Argus"
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive \
|
||||
TZ=Asia/Shanghai \
|
||||
ARGUS_LOGS_WORLD_WRITABLE=1 \
|
||||
ES_HOST=es.log.argus.com \
|
||||
ES_PORT=9200 \
|
||||
CLUSTER=local \
|
||||
RACK=dev
|
||||
|
||||
RUN set -eux; \
|
||||
apt-get update; \
|
||||
apt-get install -y --no-install-recommends \
|
||||
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata cron procps vim less \
|
||||
tar gzip; \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||
|
||||
WORKDIR /
|
||||
|
||||
# Expect staged build context to provide these directories/files
|
||||
COPY bundle/ /bundle/
|
||||
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
|
||||
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
|
||||
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
|
||||
COPY private/etc /private/etc
|
||||
COPY private/packages /private/packages
|
||||
|
||||
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
|
||||
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
|
||||
chmod 1777 /logs/train /logs/infer || true; \
|
||||
chmod 770 /buffers || true
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]
|
||||
@ -1,59 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# health-watcher.sh (GPU bundle)
|
||||
# 周期执行 check_health.sh 与 restart_unhealthy.sh,用于 GPU 节点容器内自愈。
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
|
||||
VER_DIR="${1:-}"
|
||||
|
||||
log(){ echo "[HEALTH-WATCHER] $*"; }
|
||||
|
||||
resolve_ver_dir() {
|
||||
local dir=""
|
||||
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
|
||||
dir="$VER_DIR"
|
||||
elif [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$dir" ]]; then
|
||||
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
echo "$dir"
|
||||
}
|
||||
|
||||
main() {
|
||||
log "starting with interval=${INTERVAL}s"
|
||||
local dir
|
||||
dir="$(resolve_ver_dir)"
|
||||
if [[ -z "$dir" || ! -d "$dir" ]]; then
|
||||
log "no valid install dir found under $INSTALL_ROOT; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
local chk="$dir/check_health.sh"
|
||||
local rst="$dir/restart_unhealthy.sh"
|
||||
|
||||
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
|
||||
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "watching install dir: $dir"
|
||||
|
||||
while :; do
|
||||
if [[ -x "$chk" ]]; then
|
||||
log "running check_health.sh"
|
||||
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
|
||||
fi
|
||||
if [[ -x "$rst" ]]; then
|
||||
log "running restart_unhealthy.sh"
|
||||
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
|
||||
fi
|
||||
sleep "$INTERVAL"
|
||||
done
|
||||
}
|
||||
|
||||
main "$@"
|
||||
|
||||
@ -1,135 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[BOOT] GPU node bundle starting"
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
BUNDLE_DIR="/bundle"
|
||||
STATE_DIR_BASE="/private/argus/agent"
|
||||
|
||||
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
|
||||
|
||||
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
|
||||
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
|
||||
chmod 1777 /logs/train /logs/infer || true
|
||||
else
|
||||
chmod 755 /logs/train /logs/infer || true
|
||||
fi
|
||||
chmod 770 /buffers || true
|
||||
|
||||
installed_ok=0
|
||||
|
||||
# 1) already installed?
|
||||
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
|
||||
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
|
||||
else
|
||||
# 2) try local bundle first (argus-metric_*.tar.gz)
|
||||
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
|
||||
if [[ -n "${tarball:-}" ]]; then
|
||||
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
|
||||
tmp=$(mktemp -d)
|
||||
tar -xzf "$tarball" -C "$tmp"
|
||||
# locate root containing version.json
|
||||
root="$tmp"
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
|
||||
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
|
||||
fi
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
|
||||
else
|
||||
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
|
||||
if [[ -z "$ver" ]]; then
|
||||
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
|
||||
else
|
||||
target_root="$INSTALL_ROOT"
|
||||
version_dir="$target_root/versions/$ver"
|
||||
mkdir -p "$version_dir"
|
||||
shopt -s dotglob
|
||||
mv "$root"/* "$version_dir/" 2>/dev/null || true
|
||||
shopt -u dotglob
|
||||
if [[ -f "$version_dir/install.sh" ]]; then
|
||||
chmod +x "$version_dir/install.sh" 2>/dev/null || true
|
||||
(
|
||||
export AUTO_START_DCGM="${AUTO_START_DCGM:-1}"
|
||||
export DCGM_EXPORTER_DISABLE_PROFILING="${DCGM_EXPORTER_DISABLE_PROFILING:-1}"
|
||||
export DCGM_EXPORTER_LISTEN="${DCGM_EXPORTER_LISTEN:-:9400}"
|
||||
cd "$version_dir" && ./install.sh "$version_dir"
|
||||
)
|
||||
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
|
||||
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
|
||||
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
|
||||
installed_ok=1
|
||||
echo "[BOOT] local bundle install OK: version=$ver"
|
||||
else
|
||||
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3) fallback: use FTP setup if not installed
|
||||
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
|
||||
echo "[BOOT] fallback to FTP setup"
|
||||
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
|
||||
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
|
||||
exit 1
|
||||
fi
|
||||
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
|
||||
chmod +x /tmp/setup.sh
|
||||
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
|
||||
fi
|
||||
fi
|
||||
|
||||
# 4) ensure argus-agent is running (best-effort)
|
||||
if ! pgrep -x argus-agent >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting argus-agent (not detected)"
|
||||
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
|
||||
fi
|
||||
|
||||
# 5) post-install selfcheck (run once) and state
|
||||
# prefer current version dir; fallback to first version under /opt/argus-metric/versions
|
||||
ver_dir=""
|
||||
if [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$ver_dir" ]]; then
|
||||
# pick the latest by name (semver-like); best-effort
|
||||
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
|
||||
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
|
||||
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
|
||||
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
|
||||
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
|
||||
fi
|
||||
|
||||
host="$(hostname)"
|
||||
state_dir="$STATE_DIR_BASE/${host}"
|
||||
mkdir -p "$state_dir" 2>/dev/null || true
|
||||
for i in {1..60}; do
|
||||
if [[ -s "$state_dir/node.json" ]]; then
|
||||
echo "[BOOT] node state present: $state_dir/node.json"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# 6) spawn health watcher (best-effort, non-blocking)
|
||||
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting health watcher for $ver_dir"
|
||||
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
|
||||
else
|
||||
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
|
||||
fi
|
||||
|
||||
echo "[BOOT] ready; entering sleep"
|
||||
exec sleep infinity
|
||||
@ -22,7 +22,8 @@
|
||||
[PARSER]
|
||||
Name timestamp_parser
|
||||
Format regex
|
||||
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:Z|[+-]\d{2}:?\d{2}))\s+(?<level>\w+)\s+(?<message>.*)$
|
||||
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<message>.*)$
|
||||
Time_Key timestamp
|
||||
Time_Format %Y-%m-%dT%H:%M:%S%z
|
||||
Time_Format %Y-%m-%d %H:%M:%S
|
||||
Time_Offset +0800
|
||||
Time_Keep On
|
||||
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -1,109 +1,47 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[INFO] Starting Fluent Bit setup in Ubuntu container (offline-first)..."
|
||||
echo "[INFO] Starting Fluent Bit setup in Ubuntu container..."
|
||||
|
||||
# 安装必要的工具
|
||||
echo "[INFO] Installing required packages..."
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
apt-get update -qq
|
||||
apt-get install -y -qq curl
|
||||
|
||||
# Stage bundle to /tmp (read-only mount under /private)
|
||||
echo "[INFO] Staging fluent-bit bundle..."
|
||||
rm -rf /tmp/flb && mkdir -p /tmp/flb
|
||||
cp -r /private/etc /tmp/flb/
|
||||
mkdir -p /tmp/flb/packages
|
||||
cp -r /private/packages/* /tmp/flb/packages/ 2>/dev/null || true
|
||||
# 解压bundle到/tmp
|
||||
echo "[INFO] Extracting fluent-bit bundle..."
|
||||
cp -r /private/etc /tmp
|
||||
cp -r /private/packages /tmp
|
||||
cd /tmp
|
||||
|
||||
# Helper: check and install a local deb if not already satisfied
|
||||
ensure_lib() {
|
||||
local soname="$1"; shift
|
||||
local pattern="$1"; shift
|
||||
if ldconfig -p 2>/dev/null | grep -q "$soname"; then
|
||||
echo "[OK] $soname already present"
|
||||
return 0
|
||||
fi
|
||||
local deb="$(ls /tmp/flb/packages/$pattern 2>/dev/null | head -n1 || true)"
|
||||
if [[ -n "$deb" ]]; then
|
||||
echo "[INFO] Installing local dependency: $(basename "$deb")"
|
||||
dpkg -i "$deb" >/dev/null 2>&1 || true
|
||||
else
|
||||
echo "[WARN] Local deb for $soname not found (pattern=$pattern)"
|
||||
fi
|
||||
if ! ldconfig -p 2>/dev/null | grep -q "$soname"; then
|
||||
echo "[WARN] $soname still missing after local install; attempting apt fallback"
|
||||
apt-get update -qq || true
|
||||
case "$soname" in
|
||||
libpq.so.5) apt-get install -y -qq libpq5 || true ;;
|
||||
libyaml-0.so.2) apt-get install -y -qq libyaml-0-2 || true ;;
|
||||
esac
|
||||
fi
|
||||
ldconfig 2>/dev/null || true
|
||||
}
|
||||
|
||||
# Offline-first: satisfy runtime deps from local debs, fallback to apt only if necessary
|
||||
ensure_lib "libpq.so.5" "libpq5_*_amd64.deb"
|
||||
ensure_lib "libyaml-0.so.2" "libyaml-0-2_*_amd64.deb"
|
||||
ensure_lib "libsasl2.so.2" "libsasl2-2_*_amd64.deb"
|
||||
ensure_lib "libldap-2.5.so.0" "libldap-2.5-0_*_amd64.deb"
|
||||
|
||||
# Install fluent-bit main package from local bundle
|
||||
FLB_DEB="$(ls /tmp/flb/packages/fluent-bit_*_amd64.deb 2>/dev/null | head -n1 || true)"
|
||||
if [[ -z "$FLB_DEB" ]]; then
|
||||
echo "[ERROR] fluent-bit deb not found under /private/packages" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[INFO] Installing Fluent Bit: $(basename "$FLB_DEB")"
|
||||
dpkg -i "$FLB_DEB" >/dev/null 2>&1 || true
|
||||
|
||||
# If dpkg reported unresolved dependencies, try apt -f only as last resort
|
||||
if ! command -v /opt/fluent-bit/bin/fluent-bit >/dev/null 2>&1; then
|
||||
echo "[WARN] fluent-bit binary missing after dpkg; attempting apt --fix-broken"
|
||||
apt-get install -f -y -qq || true
|
||||
fi
|
||||
|
||||
# Ensure runtime library dependencies are satisfied (libsasl2, libldap are required via libpq/curl)
|
||||
MISSING=$(ldd /opt/fluent-bit/bin/fluent-bit 2>/dev/null | awk '/not found/{print $1}' | xargs -r echo || true)
|
||||
if [[ -n "$MISSING" ]]; then
|
||||
echo "[WARN] missing shared libs: $MISSING"
|
||||
apt-get update -qq || true
|
||||
apt-get install -y -qq libsasl2-2 libldap-2.5-0 || true
|
||||
apt-get install -f -y -qq || true
|
||||
fi
|
||||
# 安装 Fluent Bit 从 deb 包
|
||||
echo "[INFO] Installing Fluent Bit from deb package..."
|
||||
dpkg -i /tmp/packages/fluent-bit_3.1.9_amd64.deb || true
|
||||
apt-get install -f -y -qq # 解决依赖问题
|
||||
|
||||
# 验证 Fluent Bit 可以运行
|
||||
echo "[INFO] Fluent Bit version:"
|
||||
/opt/fluent-bit/bin/fluent-bit --version || { echo "[ERROR] fluent-bit not installed or libraries missing" >&2; exit 1; }
|
||||
/opt/fluent-bit/bin/fluent-bit --version
|
||||
|
||||
# Place configuration
|
||||
# 创建配置目录
|
||||
mkdir -p /etc/fluent-bit
|
||||
cp -r /tmp/flb/etc/* /etc/fluent-bit/
|
||||
cp -r /tmp/etc/* /etc/fluent-bit/
|
||||
|
||||
# Create logs/buffers dirs
|
||||
# 创建日志和缓冲区目录
|
||||
mkdir -p /logs/train /logs/infer /buffers
|
||||
chmod 755 /logs/train /logs/infer /buffers
|
||||
|
||||
# 控制日志目录权限:默认对宿主 bind mount 目录采用 1777(可由环境变量关闭)
|
||||
: "${ARGUS_LOGS_WORLD_WRITABLE:=1}"
|
||||
if [[ "${ARGUS_LOGS_WORLD_WRITABLE}" == "1" ]]; then
|
||||
chmod 1777 /logs/train /logs/infer || true
|
||||
else
|
||||
chmod 755 /logs/train /logs/infer || true
|
||||
fi
|
||||
|
||||
# 缓冲目录仅供进程使用,不对外开放写入
|
||||
chmod 770 /buffers || true
|
||||
|
||||
# 目录属主设置为 fluent-bit(不影响 1777 粘滞位)
|
||||
chown -R fluent-bit:fluent-bit /logs /buffers 2>/dev/null || true
|
||||
|
||||
# Wait for Elasticsearch via bash /dev/tcp to avoid curl dependency
|
||||
echo "[INFO] Waiting for Elasticsearch to be ready (tcp ${ES_HOST}:${ES_PORT})..."
|
||||
for i in $(seq 1 120); do
|
||||
if exec 3<>/dev/tcp/${ES_HOST}/${ES_PORT}; then
|
||||
exec 3<&- 3>&-
|
||||
echo "[INFO] Elasticsearch is ready"
|
||||
break
|
||||
fi
|
||||
[[ $i -eq 120 ]] && { echo "[ERROR] ES not reachable" >&2; exit 1; }
|
||||
sleep 1
|
||||
# 等待 Elasticsearch 就绪
|
||||
echo "[INFO] Waiting for Elasticsearch to be ready..."
|
||||
while ! curl -fs http://${ES_HOST}:${ES_PORT}/_cluster/health >/dev/null 2>&1; do
|
||||
echo " Waiting for ES at ${ES_HOST}:${ES_PORT}..."
|
||||
sleep 5
|
||||
done
|
||||
echo "[INFO] Elasticsearch is ready"
|
||||
|
||||
# 启动 Fluent Bit
|
||||
echo "[INFO] Starting Fluent Bit with configuration from /etc/fluent-bit/"
|
||||
echo "[INFO] Command: /opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf"
|
||||
exec /opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf
|
||||
exec /opt/fluent-bit/bin/fluent-bit \
|
||||
--config=/etc/fluent-bit/fluent-bit.conf
|
||||
|
||||
@ -32,42 +32,3 @@ fi
|
||||
|
||||
echo "[OK] 初始化完成: private/argus/log/{elasticsearch,kibana}"
|
||||
echo "[INFO] Fluent-bit files should be in fluent-bit/ directory"
|
||||
|
||||
# 准备 Fluent Bit 离线依赖(从 metric all-in-one-full 复制 deb 到 ../fluent-bit/build/packages)
|
||||
FLB_BUILD_PACKAGES_DIR="$root/../fluent-bit/build/packages"
|
||||
mkdir -p "$FLB_BUILD_PACKAGES_DIR"
|
||||
for deb in \
|
||||
"$project_root/src/metric/client-plugins/all-in-one-full/plugins/fluent-bit/bin/libyaml-0-2_"*_amd64.deb \
|
||||
"$project_root/src/metric/client-plugins/all-in-one-full/plugins/fluent-bit/bin/libpq5_"*_amd64.deb ; do
|
||||
if ls $deb >/dev/null 2>&1; then
|
||||
for f in $deb; do
|
||||
base="$(basename "$f")"
|
||||
if [[ ! -f "$FLB_BUILD_PACKAGES_DIR/$base" ]]; then
|
||||
cp "$f" "$FLB_BUILD_PACKAGES_DIR/"
|
||||
echo " [+] copied $base"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
done
|
||||
|
||||
# 额外:从 all-in-one-full 的 ubuntu22/curl.tar.gz 解包必要依赖(libsasl2/ldap),便于离线安装
|
||||
CURLOPT_TAR="$project_root/src/metric/client-plugins/all-in-one-full/deps/ubuntu22/curl.tar.gz"
|
||||
if [[ -f "$CURLOPT_TAR" ]]; then
|
||||
tmpdir=$(mktemp -d)
|
||||
if tar -xzf "$CURLOPT_TAR" -C "$tmpdir" 2>/dev/null; then
|
||||
for p in \
|
||||
libsasl2-2_*_amd64.deb \
|
||||
libsasl2-modules-db_*_amd64.deb \
|
||||
libldap-2.5-0_*_amd64.deb \
|
||||
libidn2-0_*_amd64.deb \
|
||||
libbrotli1_*_amd64.deb \
|
||||
libssl3_*_amd64.deb ; do
|
||||
src=$(ls "$tmpdir"/curl/$p 2>/dev/null | head -n1 || true)
|
||||
if [[ -n "$src" ]]; then
|
||||
base="$(basename "$src")"
|
||||
[[ -f "$FLB_BUILD_PACKAGES_DIR/$base" ]] || cp "$src" "$FLB_BUILD_PACKAGES_DIR/" && echo " [+] staged $base"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
rm -rf "$tmpdir"
|
||||
fi
|
||||
|
||||
@ -28,11 +28,11 @@ fi
|
||||
docker exec "$container_name" mkdir -p /logs/train /logs/infer
|
||||
|
||||
# 写入训练日志 (host01)
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
|
||||
# 写入推理日志 (host01)
|
||||
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "cat <<'STACK' >> /logs/infer/infer-demo.log
|
||||
Traceback (most recent call last):
|
||||
File \"inference.py\", line 15, in <module>
|
||||
|
||||
@ -28,13 +28,13 @@ fi
|
||||
docker exec "$container_name" mkdir -p /logs/train /logs/infer
|
||||
|
||||
# 写入训练日志 (host02)
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
|
||||
# 写入推理日志 (host02)
|
||||
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
|
||||
|
||||
echo "[OK] 已通过docker exec写入测试日志到 host02 容器内:"
|
||||
echo " - /logs/train/train-demo.log"
|
||||
|
||||
@ -13,8 +13,6 @@ class AppConfig:
|
||||
scheduler_interval_seconds: int
|
||||
node_id_prefix: str
|
||||
auth_mode: str
|
||||
target_prefer_net_cidrs: str
|
||||
target_reachability_check: bool
|
||||
|
||||
|
||||
def _get_int_env(name: str, default: int) -> int:
|
||||
@ -29,12 +27,6 @@ def _get_int_env(name: str, default: int) -> int:
|
||||
|
||||
def load_config() -> AppConfig:
|
||||
"""读取环境变量生成配置对象,方便统一管理运行参数。"""
|
||||
def _bool_env(name: str, default: bool) -> bool:
|
||||
raw = os.environ.get(name)
|
||||
if raw is None or raw.strip() == "":
|
||||
return default
|
||||
return raw.strip().lower() in ("1", "true", "yes", "on")
|
||||
|
||||
return AppConfig(
|
||||
db_path=os.environ.get("DB_PATH", "/private/argus/master/db.sqlite3"),
|
||||
metric_nodes_json_path=os.environ.get(
|
||||
@ -45,6 +37,4 @@ def load_config() -> AppConfig:
|
||||
scheduler_interval_seconds=_get_int_env("SCHEDULER_INTERVAL_SECONDS", 30),
|
||||
node_id_prefix=os.environ.get("NODE_ID_PREFIX", "A"),
|
||||
auth_mode=os.environ.get("AUTH_MODE", "disabled"),
|
||||
target_prefer_net_cidrs=os.environ.get("TARGET_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16"),
|
||||
target_reachability_check=_bool_env("TARGET_REACHABILITY_CHECK", False),
|
||||
)
|
||||
|
||||
@ -1,10 +1,8 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import ipaddress
|
||||
import logging
|
||||
import socket
|
||||
import threading
|
||||
from typing import Optional, Iterable, Dict, Any, List
|
||||
from typing import Optional
|
||||
|
||||
from .config import AppConfig
|
||||
from .storage import Storage
|
||||
@ -36,117 +34,10 @@ class StatusScheduler:
|
||||
self._pending_nodes_json.set()
|
||||
|
||||
def generate_nodes_json(self) -> None:
|
||||
"""根据在线节点生成 Prometheus 抓取目标,优先 overlay IP。
|
||||
|
||||
候选顺序:meta.overlay_ip > hostname A 记录(命中偏好网段)> meta.ip。
|
||||
可选 reachability 检查:TARGET_REACHABILITY_CHECK=true 时,对 9100/9400 做一次 1s TCP 连接测试,
|
||||
选择首个可达的候选;全部失败则按顺序取第一个并记录日志。
|
||||
"""
|
||||
with self._nodes_json_lock:
|
||||
rows = self._storage.get_online_nodes_meta()
|
||||
prefer_cidrs = self._parse_cidrs(self._config.target_prefer_net_cidrs)
|
||||
reachability = self._config.target_reachability_check
|
||||
|
||||
result: List[Dict[str, Any]] = []
|
||||
for row in rows:
|
||||
meta = row.get("meta", {})
|
||||
hostname = meta.get("hostname") or row.get("name")
|
||||
labels = row.get("labels") or []
|
||||
|
||||
overlay_ip = meta.get("overlay_ip")
|
||||
legacy_ip = meta.get("ip")
|
||||
host_candidates = self._resolve_host_ips(hostname)
|
||||
host_pref = self._pick_by_cidrs(host_candidates, prefer_cidrs)
|
||||
|
||||
candidates: List[str] = []
|
||||
for ip in [overlay_ip, host_pref, legacy_ip]:
|
||||
if ip and ip not in candidates:
|
||||
candidates.append(ip)
|
||||
|
||||
chosen = None
|
||||
if reachability:
|
||||
ports = [9100]
|
||||
try:
|
||||
if int(meta.get("gpu_number", 0)) > 0:
|
||||
ports.append(9400)
|
||||
except Exception:
|
||||
pass
|
||||
for ip in candidates:
|
||||
if any(self._reachable(ip, p, 1.0) for p in ports):
|
||||
chosen = ip
|
||||
break
|
||||
if not chosen:
|
||||
chosen = candidates[0] if candidates else legacy_ip
|
||||
if not chosen:
|
||||
# ultimate fallback: 127.0.0.1 (should not happen)
|
||||
chosen = "127.0.0.1"
|
||||
self._logger.warning("No candidate IPs for node; falling back", extra={"node": row.get("node_id")})
|
||||
|
||||
if chosen and ipaddress.ip_address(chosen) in ipaddress.ip_network("172.22.0.0/16"):
|
||||
self._logger.warning(
|
||||
"Prometheus target uses docker_gwbridge address; prefer overlay",
|
||||
extra={"node": row.get("node_id"), "ip": chosen},
|
||||
)
|
||||
|
||||
result.append(
|
||||
{
|
||||
"node_id": row.get("node_id"),
|
||||
"user_id": meta.get("user"),
|
||||
"ip": chosen,
|
||||
"hostname": hostname,
|
||||
"labels": labels if isinstance(labels, list) else [],
|
||||
}
|
||||
)
|
||||
|
||||
atomic_write_json(self._config.metric_nodes_json_path, result)
|
||||
self._logger.info("nodes.json updated", extra={"count": len(result)})
|
||||
|
||||
# ---------------------------- helpers ----------------------------
|
||||
@staticmethod
|
||||
def _parse_cidrs(raw: str) -> List[ipaddress.IPv4Network]:
|
||||
nets: List[ipaddress.IPv4Network] = []
|
||||
for item in (x.strip() for x in (raw or "").split(",")):
|
||||
if not item:
|
||||
continue
|
||||
try:
|
||||
net = ipaddress.ip_network(item, strict=False)
|
||||
if isinstance(net, ipaddress.IPv4Network):
|
||||
nets.append(net)
|
||||
except ValueError:
|
||||
continue
|
||||
return nets
|
||||
|
||||
@staticmethod
|
||||
def _resolve_host_ips(hostname: str) -> List[str]:
|
||||
ips: List[str] = []
|
||||
try:
|
||||
infos = socket.getaddrinfo(hostname, None, family=socket.AF_INET)
|
||||
for info in infos:
|
||||
ip = info[4][0]
|
||||
if ip not in ips:
|
||||
ips.append(ip)
|
||||
except OSError:
|
||||
pass
|
||||
return ips
|
||||
|
||||
@staticmethod
|
||||
def _pick_by_cidrs(candidates: Iterable[str], prefer: List[ipaddress.IPv4Network]) -> str | None:
|
||||
for net in prefer:
|
||||
for ip in candidates:
|
||||
try:
|
||||
if ipaddress.ip_address(ip) in net:
|
||||
return ip
|
||||
except ValueError:
|
||||
continue
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _reachable(ip: str, port: int, timeout: float) -> bool:
|
||||
try:
|
||||
with socket.create_connection((ip, port), timeout=timeout):
|
||||
return True
|
||||
except OSError:
|
||||
return False
|
||||
online_nodes = self._storage.get_online_nodes()
|
||||
atomic_write_json(self._config.metric_nodes_json_path, online_nodes)
|
||||
self._logger.info("nodes.json updated", extra={"count": len(online_nodes)})
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# internal loop
|
||||
|
||||
@ -324,35 +324,9 @@ class Storage:
|
||||
{
|
||||
"node_id": row["id"],
|
||||
"user_id": meta.get("user"),
|
||||
"ip": meta.get("ip"), # kept for backward-compat; preferred IP selection handled in scheduler
|
||||
"ip": meta.get("ip"),
|
||||
"hostname": meta.get("hostname", row["name"]),
|
||||
"labels": labels if isinstance(labels, list) else [],
|
||||
}
|
||||
)
|
||||
return result
|
||||
|
||||
def get_online_nodes_meta(self) -> List[Dict[str, Any]]:
|
||||
"""返回在线节点的原始 meta 与名称、标签,交由上层选择目标 IP。
|
||||
|
||||
每项包含:{ node_id, name, meta, labels }
|
||||
"""
|
||||
with self._lock:
|
||||
cur = self._conn.execute(
|
||||
"SELECT id, name, meta_json, labels_json FROM nodes WHERE status = ? ORDER BY id ASC",
|
||||
("online",),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
|
||||
result: List[Dict[str, Any]] = []
|
||||
for row in rows:
|
||||
meta = json.loads(row["meta_json"]) if row["meta_json"] else {}
|
||||
labels = json.loads(row["labels_json"]) if row["labels_json"] else []
|
||||
result.append(
|
||||
{
|
||||
"node_id": row["id"],
|
||||
"name": row["name"],
|
||||
"meta": meta if isinstance(meta, dict) else {},
|
||||
"labels": labels if isinstance(labels, list) else [],
|
||||
}
|
||||
)
|
||||
return result
|
||||
|
||||
2
src/metric/.gitignore
vendored
2
src/metric/.gitignore
vendored
@ -4,4 +4,4 @@
|
||||
/client-plugins/demo-all-in-one/publish/
|
||||
/client-plugins/demo-all-in-one/checklist
|
||||
/client-plugins/demo-all-in-one/VERSION
|
||||
/client-plugins/all-in-one-full/artifact/
|
||||
/client-plugins/all-in-one-full/
|
||||
|
||||
@ -104,26 +104,7 @@ log_info "文件所有者: $OWNER"
|
||||
|
||||
# 确保发布目录存在
|
||||
log_info "确保发布目录存在: $PUBLISH_DIR"
|
||||
mkdir -p "$PUBLISH_DIR"
|
||||
|
||||
IFS=':' read -r OWNER_UID OWNER_GID <<< "$OWNER"
|
||||
if [[ -z "$OWNER_UID" || -z "$OWNER_GID" ]]; then
|
||||
log_error "--owner 格式不正确,应为 uid:gid"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
CURRENT_UID=$(id -u)
|
||||
CURRENT_GID=$(id -g)
|
||||
if [[ "$OWNER_UID" != "$CURRENT_UID" || "$OWNER_GID" != "$CURRENT_GID" ]]; then
|
||||
if [[ "$CURRENT_UID" -ne 0 ]]; then
|
||||
log_error "当前用户 (${CURRENT_UID}:${CURRENT_GID}) 无法设置所有者为 ${OWNER_UID}:${OWNER_GID}"
|
||||
log_error "请以目标用户运行脚本或预先调整目录权限"
|
||||
exit 1
|
||||
fi
|
||||
NEED_CHOWN=true
|
||||
else
|
||||
NEED_CHOWN=false
|
||||
fi
|
||||
sudo mkdir -p "$PUBLISH_DIR"
|
||||
|
||||
# 创建临时目录用于打包
|
||||
TEMP_PACKAGE_DIR="/tmp/argus-metric-package-$$"
|
||||
@ -227,31 +208,26 @@ fi
|
||||
TAR_NAME="argus-metric_$(echo $VERSION | tr '.' '_').tar.gz"
|
||||
log_info "创建发布包: $TAR_NAME"
|
||||
cd "$TEMP_PACKAGE_DIR"
|
||||
tar -czf "$PUBLISH_DIR/$TAR_NAME" *
|
||||
sudo tar -czf "$PUBLISH_DIR/$TAR_NAME" *
|
||||
cd - > /dev/null
|
||||
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
log_info "设置文件所有者为: $OWNER"
|
||||
chown "$OWNER" "$PUBLISH_DIR/$TAR_NAME"
|
||||
fi
|
||||
# 设置文件所有者
|
||||
log_info "设置文件所有者为: $OWNER"
|
||||
sudo chown "$OWNER" "$PUBLISH_DIR/$TAR_NAME"
|
||||
|
||||
# 清理临时目录
|
||||
rm -rf "$TEMP_PACKAGE_DIR"
|
||||
|
||||
# 更新 LATEST_VERSION 文件
|
||||
log_info "更新 LATEST_VERSION 文件..."
|
||||
echo "$VERSION" > "$PUBLISH_DIR/LATEST_VERSION"
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
chown "$OWNER" "$PUBLISH_DIR/LATEST_VERSION"
|
||||
fi
|
||||
echo "$VERSION" | sudo tee "$PUBLISH_DIR/LATEST_VERSION" > /dev/null
|
||||
sudo chown "$OWNER" "$PUBLISH_DIR/LATEST_VERSION"
|
||||
|
||||
# 复制 DNS 配置文件到发布目录根目录(直接从 config 目录复制)
|
||||
if [[ -f "config/dns.conf" ]]; then
|
||||
log_info "复制 DNS 配置文件到发布目录根目录..."
|
||||
cp "config/dns.conf" "$PUBLISH_DIR/"
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
chown "$OWNER" "$PUBLISH_DIR/dns.conf"
|
||||
fi
|
||||
sudo cp "config/dns.conf" "$PUBLISH_DIR/"
|
||||
sudo chown "$OWNER" "$PUBLISH_DIR/dns.conf"
|
||||
log_success "DNS 配置文件复制完成: $PUBLISH_DIR/dns.conf"
|
||||
else
|
||||
log_warning "未找到 config/dns.conf 文件,跳过 DNS 配置文件复制"
|
||||
@ -260,10 +236,8 @@ fi
|
||||
# 复制 setup.sh 到发布目录
|
||||
if [[ -f "scripts/setup.sh" ]]; then
|
||||
log_info "复制 setup.sh 到发布目录..."
|
||||
cp "scripts/setup.sh" "$PUBLISH_DIR/"
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
chown "$OWNER" "$PUBLISH_DIR/setup.sh"
|
||||
fi
|
||||
sudo cp "scripts/setup.sh" "$PUBLISH_DIR/"
|
||||
sudo chown "$OWNER" "$PUBLISH_DIR/setup.sh"
|
||||
fi
|
||||
|
||||
# 显示发布结果
|
||||
|
||||
@ -1,59 +0,0 @@
|
||||
# 客户侧组件安装包构建、发布流程
|
||||
|
||||
## 第一步:配置版本和组件
|
||||
|
||||
首先搞定配置文件:
|
||||
|
||||
1. 把 `.checklist.example` 重命名成 `checklist`
|
||||
2. 把 `.VERSION.example` 重命名成 `VERSION`
|
||||
|
||||
### checklist 文件格式
|
||||
```
|
||||
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
|
||||
dcgm-exporter-installer /path/to/dcgm-exporter-installer 1.1.0
|
||||
node-exporter-installer /path/to/node-exporter-installer 1.1.0
|
||||
```
|
||||
|
||||
### VERSION 文件
|
||||
设置需要发布的版本号,比如 `1.29.0`
|
||||
|
||||
> 建议用 `version-manager.sh` 来管理版本
|
||||
|
||||
## 第二步:构建安装包
|
||||
|
||||
直接跑脚本:
|
||||
```bash
|
||||
./package_artifact.sh
|
||||
```
|
||||
|
||||
构建完的东西会放在 `artifact/` 目录下,按版本分文件夹。
|
||||
|
||||
如果版本已经存在了,想要覆盖重新构建:
|
||||
```bash
|
||||
./package_artifact.sh --force
|
||||
```
|
||||
|
||||
构建完可以手工测试安装包。
|
||||
|
||||
## 第三步:发布安装包
|
||||
|
||||
用这个脚本发布:
|
||||
```bash
|
||||
./publish_artifact.sh
|
||||
```
|
||||
|
||||
发布后的内容在 `publish/` 目录里,包含:
|
||||
- 压缩版本的安装包
|
||||
- 一键安装的bash脚本
|
||||
|
||||
## 第四步:部署到FTP服务器
|
||||
|
||||
把发布的内容上传到FTP服务器,客户端就可以通过一键命令安装:
|
||||
|
||||
```bash
|
||||
curl -fsSL http://your-ftp-server/install.sh | sh -
|
||||
|
||||
curl -fsSL "ftp://ftpuser:{PASSWD}!@10.211.55.4/share/setup.sh" | sudo bash -s -- --server 10.211.55.4 --user ftpuser --password {PASSWD}
|
||||
```
|
||||
|
||||
这样客户就能直接从FTP服务器下载并安装组件了。
|
||||
@ -1 +0,0 @@
|
||||
1.29.0
|
||||
@ -1,3 +0,0 @@
|
||||
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
|
||||
dcgm-exporter-installer /Users/sundapeng/Project/nlp/aiops/client-plugins/dcgm-exporter-installer 1.1.0
|
||||
node-exporter-installer /Users/sundapeng/Project/nlp/aiops/client-plugins/node-exporter-installer 1.1.0
|
||||
@ -1 +0,0 @@
|
||||
1.44.0
|
||||
@ -1,5 +0,0 @@
|
||||
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
|
||||
argus-agent plugins/argus-agent 1.0.0
|
||||
node-exporter plugins/node-exporter 1.0.0
|
||||
dcgm-exporter plugins/dcgm-exporter 1.0.0
|
||||
fluent-bit plugins/fluent-bit 1.0.0
|
||||
@ -1,14 +0,0 @@
|
||||
# Elasticsearch
|
||||
ES_HOST=es.log.argus.com
|
||||
ES_PORT=9200
|
||||
|
||||
# Argus-Agent
|
||||
# 连接master服务
|
||||
MASTER_ENDPOINT=master.argus.com:3000
|
||||
# 上报状态间隔描述
|
||||
REPORT_INTERVAL_SECONDS=5
|
||||
|
||||
# FTP
|
||||
FTP_SERVER=172.31.0.40
|
||||
FTP_USER=ftpuser
|
||||
FTP_PASSWORD=ZGClab1234!
|
||||
@ -1,8 +0,0 @@
|
||||
# Argus Metric 配置文件示例
|
||||
# 复制此文件为 config.env 并根据需要修改配置
|
||||
|
||||
# 连接master服务
|
||||
MASTER_ENDPOINT=master.argus.com:3000
|
||||
|
||||
# 上报状态间隔描述(秒)
|
||||
REPORT_INTERVAL_SECONDS=60
|
||||
@ -1 +0,0 @@
|
||||
172.31.0.2
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -1 +0,0 @@
|
||||
bin/
|
||||
@ -1,94 +0,0 @@
|
||||
# Argus Agent 插件
|
||||
|
||||
这是 Argus Agent 的安装和管理插件,提供了完整的安装、卸载、健康检查功能。
|
||||
|
||||
## 文件结构
|
||||
|
||||
```
|
||||
argus-agent/
|
||||
├── bin/
|
||||
│ └── argus-agent # Argus Agent 二进制文件
|
||||
├── config/ # 配置文件目录
|
||||
├── install.sh # 安装脚本
|
||||
├── uninstall.sh # 卸载脚本
|
||||
├── check_health.sh # 健康检查脚本
|
||||
├── package.sh # 打包脚本
|
||||
└── README.md # 说明文档
|
||||
```
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 安装
|
||||
|
||||
```bash
|
||||
sudo ./install.sh
|
||||
```
|
||||
|
||||
安装脚本会:
|
||||
- 检查系统要求
|
||||
- 停止可能运行的服务
|
||||
- 安装二进制文件到 `/usr/local/bin/argus-agent`
|
||||
- 创建 `argus-agent` 用户
|
||||
- 创建配置和数据目录
|
||||
- 启动服务并记录 PID
|
||||
|
||||
### 卸载
|
||||
|
||||
```bash
|
||||
sudo ./uninstall.sh
|
||||
```
|
||||
|
||||
卸载脚本会:
|
||||
- 停止所有 argus-agent 进程
|
||||
- 删除二进制文件
|
||||
- 删除配置和数据目录
|
||||
- 清理日志文件
|
||||
- 更新安装记录
|
||||
|
||||
### 健康检查
|
||||
|
||||
```bash
|
||||
./check_health.sh
|
||||
```
|
||||
|
||||
健康检查脚本会:
|
||||
- 检查安装记录中的 PID
|
||||
- 验证进程是否正在运行
|
||||
- 输出 JSON 格式的健康状态
|
||||
|
||||
### 打包
|
||||
|
||||
```bash
|
||||
./package.sh
|
||||
```
|
||||
|
||||
打包脚本会:
|
||||
- 检查所有必要文件
|
||||
- 创建时间戳命名的压缩包
|
||||
- 输出安装包信息
|
||||
|
||||
## 安装后的文件位置
|
||||
|
||||
- 二进制文件: `/usr/local/bin/argus-agent`
|
||||
- 配置目录: `/etc/argus-agent/`
|
||||
- 数据目录: `/var/lib/argus-agent/`
|
||||
- 日志文件: `/var/log/argus-agent.log`
|
||||
- PID 文件: `/var/run/argus-agent.pid`
|
||||
- 安装记录: `/opt/argus-metric/current/.install_record`
|
||||
|
||||
## 健康检查输出格式
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "argus-agent",
|
||||
"status": "health|unhealth",
|
||||
"reason": "状态说明"
|
||||
}
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
|
||||
1. 安装和卸载脚本需要 root 权限
|
||||
2. 健康检查脚本使用安装记录中的 PID 来验证进程状态
|
||||
3. 如果 jq 命令不可用,健康检查会使用简单的文本解析
|
||||
4. 卸载时会保留 `argus-agent` 用户,避免影响其他服务
|
||||
@ -1,69 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Argus Agent 健康检查脚本
|
||||
# 输出 JSON 格式结果
|
||||
|
||||
set -e
|
||||
|
||||
# 检查 Argus Agent 健康状态
|
||||
check_health() {
|
||||
local name="argus-agent"
|
||||
local status="unhealth"
|
||||
local reason=""
|
||||
local install_record="/opt/argus-metric/current/.install_record"
|
||||
|
||||
# 首先尝试通过安装记录文件检查进程
|
||||
if [[ -f "$install_record" ]]; then
|
||||
# 尝试使用jq解析JSON格式的安装记录文件
|
||||
local pid=""
|
||||
if command -v jq &> /dev/null; then
|
||||
pid=$(jq -r '.components."argus-agent".pid // empty' "$install_record" 2>/dev/null || echo "")
|
||||
else
|
||||
# 如果没有jq,使用简单的文本解析方法
|
||||
pid=$(grep -A 10 '"argus-agent"' "$install_record" | grep '"pid"' | cut -d'"' -f4 | head -1)
|
||||
fi
|
||||
|
||||
if [[ -n "$pid" && "$pid" =~ ^[0-9]+$ ]]; then
|
||||
if kill -0 "$pid" 2>/dev/null; then
|
||||
# 进程存在且运行正常
|
||||
status="health"
|
||||
reason="进程运行正常 (PID: $pid)"
|
||||
echo "{\"name\": \"$name\", \"status\": \"$status\", \"reason\": \"$reason\"}"
|
||||
exit 0
|
||||
else
|
||||
reason="安装记录中的 PID $pid 进程不存在"
|
||||
echo "{\"name\": \"$name\", \"status\": \"$status\", \"reason\": \"$reason\"}"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
reason="安装记录文件中未找到有效的 argus-agent PID"
|
||||
echo "{\"name\": \"$name\", \"status\": \"$status\", \"reason\": \"$reason\"}"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
# 如果安装记录文件不存在,尝试查找 argus-agent 进程
|
||||
local pids=$(pgrep -f "argus-agent" 2>/dev/null || true)
|
||||
if [[ -n "$pids" ]]; then
|
||||
# 取第一个找到的 PID
|
||||
local pid=$(echo "$pids" | head -1)
|
||||
status="health"
|
||||
reason="发现 argus-agent 进程运行 (PID: $pid),但未找到安装记录"
|
||||
echo "{\"name\": \"$name\", \"status\": \"$status\", \"reason\": \"$reason\"}"
|
||||
exit 0
|
||||
else
|
||||
reason="未找到 argus-agent 进程,且安装记录文件不存在"
|
||||
echo "{\"name\": \"$name\", \"status\": \"$status\", \"reason\": \"$reason\"}"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# 主函数
|
||||
main() {
|
||||
check_health
|
||||
}
|
||||
|
||||
# 脚本入口
|
||||
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
|
||||
main "$@"
|
||||
fi
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user