Compare commits
22 Commits
main
...
dev_1.0.0_
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b6d49ea562 | ||
|
|
5211769ba8 | ||
|
|
28ef5df6e4 | ||
|
|
687d29993a | ||
|
|
54185101f0 | ||
|
|
018cdb40cc | ||
|
|
36290bb08f | ||
|
|
d514283cda | ||
|
|
bfa1d71c49 | ||
|
|
cb659f581b | ||
|
|
55f9317a1b | ||
|
|
f305a65abf | ||
|
|
46c34f3de6 | ||
|
|
8e2d751c2c | ||
|
|
ab44edc22f | ||
|
|
586740a17a | ||
|
|
5fa6e80578 | ||
|
|
e941537dd5 | ||
|
|
4fb8652177 | ||
| 1e5e91b193 | |||
| 8a38d3d0b2 | |||
| 26e1c964ed |
1
.gitattributes
vendored
1
.gitattributes
vendored
@ -1 +0,0 @@
|
||||
src/metric/client-plugins/all-in-one-full/plugins/*/bin/* filter=lfs diff=lfs merge=lfs -text
|
||||
150
build/README.md
150
build/README.md
@ -1,150 +0,0 @@
|
||||
# ARGUS 统一构建脚本使用说明(build/build_images.sh)
|
||||
|
||||
本目录提供单一入口脚本 `build/build_images.sh`,覆盖常见三类场景:
|
||||
- 系统集成测试(src/sys/tests)
|
||||
- Swarm 系统集成测试(src/sys/swarm_tests)
|
||||
- 构建离线安装包(deployment_new:Server/Client‑GPU)
|
||||
|
||||
文档还说明 UID/GID 取值规则、镜像 tag 策略、常用参数与重试机制。
|
||||
|
||||
## 环境前置
|
||||
- Docker Engine ≥ 20.10(建议 ≥ 23.x/24.x)
|
||||
- Docker Compose v2(`docker compose` 子命令)
|
||||
- 可选:内网构建镜像源(`--intranet`)
|
||||
|
||||
## UID/GID 规则(用于容器内用户/卷属主)
|
||||
- 非 pkg 构建(core/master/metric/web/alert/sys/gpu_bundle/cpu_bundle):
|
||||
- 读取 `configs/build_user.local.conf` → `configs/build_user.conf`;
|
||||
- 可被环境变量覆盖:`ARGUS_BUILD_UID`、`ARGUS_BUILD_GID`;
|
||||
- pkg 构建(`--only server_pkg`、`--only client_pkg`):
|
||||
- 读取 `configs/build_user.pkg.conf`(优先)→ `build_user.local.conf` → `build_user.conf`;
|
||||
- 可被环境变量覆盖;
|
||||
- CPU bundle 明确走“非 pkg”链(不读取 `build_user.pkg.conf`)。
|
||||
- 说明:仅依赖 UID/GID 的 Docker 层会因参数变动而自动重建,不同构建剖面不会“打错包”。
|
||||
|
||||
## 镜像 tag 策略
|
||||
- 非 pkg 构建:默认输出 `:latest`。
|
||||
- `--only server_pkg`:所有镜像直接输出为 `:<VERSION>`(不覆盖 `:latest`)。
|
||||
- `--only client_pkg`:GPU bundle 仅输出 `:<VERSION>`(不覆盖 `:latest`)。
|
||||
- `--only cpu_bundle`:默认仅输出 `:<VERSION>`;可加 `--tag-latest` 同时打 `:latest` 以兼容 swarm_tests 默认 compose。
|
||||
|
||||
## 不加 --only 的默认构建目标
|
||||
不指定 `--only` 时,脚本会构建“基础镜像集合”(不含 bundle 与安装包):
|
||||
- core:`argus-elasticsearch:latest`、`argus-kibana:latest`、`argus-bind9:latest`
|
||||
- master:`argus-master:latest`(非 offline)
|
||||
- metric:`argus-metric-ftp:latest`、`argus-metric-prometheus:latest`、`argus-metric-grafana:latest`
|
||||
- web:`argus-web-frontend:latest`、`argus-web-proxy:latest`
|
||||
- alert:`argus-alertmanager:latest`
|
||||
- sys:`argus-sys-node:latest`、`argus-sys-metric-test-node:latest`、`argus-sys-metric-test-gpu-node:latest`
|
||||
|
||||
说明:默认 tag 为 `:latest`;UID/GID 走“非 pkg”链(`build_user.local.conf → build_user.conf`,可被环境变量覆盖)。
|
||||
|
||||
## 通用参数
|
||||
- `--intranet`:使用内网构建参数(各 Dockerfile 中按需启用)。
|
||||
- `--no-cache`:禁用 Docker 层缓存。
|
||||
- `--only <list>`:逗号分隔目标,例:`--only core,master,metric,web,alert`。
|
||||
- `--version YYMMDD`:bundle/pkg 的日期标签(必填于 cpu_bundle/gpu_bundle/server_pkg/client_pkg)。
|
||||
- `--client-semver X.Y.Z`:all‑in‑one‑full 客户端语义化版本(可选)。
|
||||
- `--cuda VER`:GPU bundle CUDA 基镜版本(默认 12.2.2)。
|
||||
- `--tag-latest`:CPU bundle 构建时同时打 `:latest`。
|
||||
|
||||
## 自动重试
|
||||
- 构建单镜像失败会自动重试(默认 3 次,间隔 5s)。
|
||||
- 最后一次自动使用 `DOCKER_BUILDKIT=0` 再试,缓解 “failed to receive status: context canceled”。
|
||||
- 可调:`ARGUS_BUILD_RETRIES`、`ARGUS_BUILD_RETRY_DELAY` 环境变量。
|
||||
|
||||
---
|
||||
|
||||
## 场景一:系统集成测试(src/sys/tests)
|
||||
构建用于系统级端到端测试的镜像(默认 `:latest`)。
|
||||
|
||||
示例:
|
||||
```
|
||||
# 构建核心与周边
|
||||
./build/build_images.sh --only core,master,metric,web,alert,sys
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-elasticsearch:latest`、`argus-kibana:latest`、`argus-master:latest`、`argus-metric-ftp:latest`、`argus-metric-prometheus:latest`、`argus-metric-grafana:latest`、`argus-alertmanager:latest`、`argus-web-frontend:latest`、`argus-web-proxy:latest`、`argus-sys-node:latest` 等。
|
||||
|
||||
说明:
|
||||
- UID/GID 读取 `build_user.local.conf → build_user.conf`(或环境变量覆盖)。
|
||||
- sys/tests 的执行见 `src/sys/tests/README.md`。
|
||||
|
||||
---
|
||||
|
||||
## 场景二:Swarm 系统集成测试(src/sys/swarm_tests)
|
||||
需要服务端镜像 + CPU 节点 bundle 镜像。
|
||||
|
||||
步骤:
|
||||
1) 构建服务端镜像(默认 `:latest`)
|
||||
```
|
||||
./build/build_images.sh --only core,master,metric,web,alert
|
||||
```
|
||||
2) 构建 CPU bundle(直接 FROM ubuntu:22.04)
|
||||
```
|
||||
# 仅版本 tag 输出
|
||||
./build/build_images.sh --only cpu_bundle --version 20251114
|
||||
# 若要兼容 swarm_tests 默认 latest:
|
||||
./build/build_images.sh --only cpu_bundle --version 20251114 --tag-latest
|
||||
```
|
||||
3) 运行 Swarm 测试
|
||||
```
|
||||
cd src/sys/swarm_tests
|
||||
# 如未打 latest,可先指定:
|
||||
export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:20251114
|
||||
./scripts/01_server_up.sh
|
||||
./scripts/02_wait_ready.sh
|
||||
./scripts/03_nodes_up.sh
|
||||
./scripts/04_metric_verify.sh # 验证 Prometheus/Grafana/nodes.json 与日志通路
|
||||
./scripts/99_down.sh # 结束
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-*:latest` 与 `argus-sys-metric-test-node-bundle:20251114`(或 latest)。
|
||||
- `swarm_tests/private-*`:运行态持久化文件。
|
||||
|
||||
说明:
|
||||
- CPU bundle 构建用户走“非 pkg”链(local.conf → conf)。
|
||||
- `04_metric_verify.sh` 已内置 Fluent Bit 启动与配置修正逻辑,偶发未就绪可重跑一次即通过。
|
||||
|
||||
---
|
||||
|
||||
## 场景三:构建离线安装包(deployment_new)
|
||||
Server 与 Client‑GPU 安装包均采用“版本直出”,只输出 `:<VERSION>` 标签,不改动 `:latest`。
|
||||
|
||||
1) Server 包
|
||||
```
|
||||
./build/build_images.sh --only server_pkg --version 20251114
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-<模块>:20251114`(不触碰 latest)。
|
||||
- 安装包:`deployment_new/artifact/server/20251114/` 与 `server_20251114.tar.gz`
|
||||
- 包内包含:逐镜像 tar.gz、compose/.env.example、scripts(config/install/selfcheck/diagnose 等)、docs、manifest/checksums。
|
||||
|
||||
2) Client‑GPU 包
|
||||
```
|
||||
# 同步构建 GPU bundle(仅 :<VERSION>,不触碰 latest),并生成客户端包
|
||||
./build/build_images.sh --only client_pkg --version 20251114 \\
|
||||
--client-semver 1.44.0 --cuda 12.2.2
|
||||
```
|
||||
产出:
|
||||
- 本地镜像:`argus-sys-metric-test-node-bundle-gpu:20251114`
|
||||
- 安装包:`deployment_new/artifact/client_gpu/20251114/` 与 `client_gpu_20251114.tar.gz`
|
||||
- 包内包含:GPU bundle 镜像 tar.gz、busybox.tar、compose/.env.example、scripts(config/install/uninstall)、docs、manifest/checksums。
|
||||
|
||||
说明:
|
||||
- pkg 构建使用 `configs/build_user.pkg.conf` 的 UID/GID(可被环境覆盖)。
|
||||
- 包内 `.env.example` 的 `PKG_VERSION=<VERSION>` 与镜像 tag 严格一致。
|
||||
|
||||
---
|
||||
|
||||
## 常见问题(FAQ)
|
||||
- 构建报 `failed to receive status: context canceled`?
|
||||
- 已内置单镜像多次重试,最后一次禁用 BuildKit;建议加 `--intranet` 与 `--no-cache` 重试,或 `docker builder prune -f` 后再试。
|
||||
- 先跑非 pkg(latest),再跑 pkg(version)会不会“打错包”?
|
||||
- 不会。涉及 UID/GID 的层因参数变化会重建,其它层按缓存命中复用,最终 pkg 产物的属主与运行账户按 `build_user.pkg.conf` 生效。
|
||||
- swarm_tests 默认拉取 `:latest`,我只构建了 `:<VERSION>` 的 CPU bundle 怎么办?
|
||||
- 在运行前 `export NODE_BUNDLE_IMAGE_TAG=argus-sys-metric-test-node-bundle:<VERSION>`,或在构建时加 `--tag-latest`。
|
||||
|
||||
---
|
||||
|
||||
如需进一步自动化(例如生成 BUILD_SUMMARY.txt 汇总镜像 digest 与构建参数),可在 pkg 产出阶段追加,我可以按需补齐。
|
||||
@ -10,45 +10,19 @@ Usage: $0 [OPTIONS]
|
||||
Options:
|
||||
--intranet Use intranet mirror for log/bind builds
|
||||
--master-offline Build master offline image (requires src/master/offline_wheels.tar.gz)
|
||||
--metric Build metric module images (ftp, prometheus, grafana, test nodes)
|
||||
--no-cache Build all images without using Docker layer cache
|
||||
--only LIST Comma-separated targets to build: core,master,metric,web,alert,sys,gpu_bundle,cpu_bundle,server_pkg,client_pkg,all
|
||||
--version DATE Date tag used by gpu_bundle/server_pkg/client_pkg (e.g. 20251112)
|
||||
--client-semver X.Y.Z Override client semver used in all-in-one-full artifact (optional)
|
||||
--cuda VER CUDA runtime version for NVIDIA base (default: 12.2.2)
|
||||
--tag-latest Also tag bundle image as :latest (for cpu_bundle only; default off)
|
||||
-h, --help Show this help message
|
||||
|
||||
Examples:
|
||||
$0 # Build with default sources
|
||||
$0 --intranet # Build with intranet mirror
|
||||
$0 --master-offline # Additionally build argus-master:offline
|
||||
$0 --metric # Additionally build metric module images
|
||||
$0 --intranet --master-offline --metric
|
||||
$0 --intranet --master-offline
|
||||
EOF
|
||||
}
|
||||
|
||||
use_intranet=false
|
||||
build_core=true
|
||||
build_master=true
|
||||
build_master_offline=false
|
||||
build_metric=true
|
||||
build_web=true
|
||||
build_alert=true
|
||||
build_sys=true
|
||||
build_gpu_bundle=false
|
||||
build_cpu_bundle=false
|
||||
build_server_pkg=false
|
||||
build_client_pkg=false
|
||||
need_bind_image=true
|
||||
need_metric_ftp=true
|
||||
no_cache=false
|
||||
|
||||
bundle_date=""
|
||||
client_semver=""
|
||||
cuda_ver="12.2.2"
|
||||
DEFAULT_IMAGE_TAG="latest"
|
||||
tag_latest=false
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
@ -65,55 +39,6 @@ while [[ $# -gt 0 ]]; do
|
||||
build_master_offline=true
|
||||
shift
|
||||
;;
|
||||
--metric)
|
||||
build_metric=true
|
||||
shift
|
||||
;;
|
||||
--no-cache)
|
||||
no_cache=true
|
||||
shift
|
||||
;;
|
||||
--only)
|
||||
if [[ -z ${2:-} ]]; then
|
||||
echo "--only requires a target list" >&2; exit 1
|
||||
fi
|
||||
sel="$2"; shift 2
|
||||
# reset all, then enable selected
|
||||
build_core=false; build_master=false; build_metric=false; build_web=false; build_alert=false; build_sys=false; build_gpu_bundle=false; build_cpu_bundle=false; build_server_pkg=false; build_client_pkg=false
|
||||
IFS=',' read -ra parts <<< "$sel"
|
||||
for p in "${parts[@]}"; do
|
||||
case "$p" in
|
||||
core) build_core=true ;;
|
||||
master) build_master=true ;;
|
||||
metric) build_metric=true ;;
|
||||
web) build_web=true ;;
|
||||
alert) build_alert=true ;;
|
||||
sys) build_sys=true ;;
|
||||
gpu_bundle) build_gpu_bundle=true ;;
|
||||
cpu_bundle) build_cpu_bundle=true ;;
|
||||
server_pkg) build_server_pkg=true; build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true ;;
|
||||
client_pkg) build_client_pkg=true ;;
|
||||
all) build_core=true; build_master=true; build_metric=true; build_web=true; build_alert=true; build_sys=true ;;
|
||||
*) echo "Unknown --only target: $p" >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
;;
|
||||
--version)
|
||||
if [[ -z ${2:-} ]]; then echo "--version requires a value like 20251112" >&2; exit 1; fi
|
||||
bundle_date="$2"; shift 2
|
||||
;;
|
||||
--client-semver)
|
||||
if [[ -z ${2:-} ]]; then echo "--client-semver requires a value like 1.43.0" >&2; exit 1; fi
|
||||
client_semver="$2"; shift 2
|
||||
;;
|
||||
--cuda)
|
||||
if [[ -z ${2:-} ]]; then echo "--cuda requires a value like 12.2.2" >&2; exit 1; fi
|
||||
cuda_ver="$2"; shift 2
|
||||
;;
|
||||
--tag-latest)
|
||||
tag_latest=true
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
show_help
|
||||
exit 0
|
||||
@ -126,11 +51,6 @@ while [[ $# -gt 0 ]]; do
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ "$build_server_pkg" == true ]]; then
|
||||
need_bind_image=false
|
||||
need_metric_ftp=false
|
||||
fi
|
||||
|
||||
root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
. "$root/scripts/common/build_user.sh"
|
||||
|
||||
@ -142,23 +62,9 @@ fi
|
||||
|
||||
cd "$root"
|
||||
|
||||
# Set default image tag policy before building
|
||||
if [[ "$build_server_pkg" == true ]]; then
|
||||
DEFAULT_IMAGE_TAG="${bundle_date:-latest}"
|
||||
fi
|
||||
|
||||
# Select build user profile for pkg vs default
|
||||
if [[ "$build_server_pkg" == true || "$build_client_pkg" == true ]]; then
|
||||
export ARGUS_BUILD_PROFILE=pkg
|
||||
fi
|
||||
|
||||
load_build_user
|
||||
build_args+=("--build-arg" "ARGUS_BUILD_UID=${ARGUS_BUILD_UID}" "--build-arg" "ARGUS_BUILD_GID=${ARGUS_BUILD_GID}")
|
||||
|
||||
if [[ "$no_cache" == true ]]; then
|
||||
build_args+=("--no-cache")
|
||||
fi
|
||||
|
||||
master_root="$root/src/master"
|
||||
master_offline_tar="$master_root/offline_wheels.tar.gz"
|
||||
master_offline_dir="$master_root/offline_wheels"
|
||||
@ -199,459 +105,45 @@ build_image() {
|
||||
local image_name=$1
|
||||
local dockerfile_path=$2
|
||||
local tag=$3
|
||||
local context="."
|
||||
shift 3
|
||||
|
||||
if [[ $# -gt 0 ]]; then
|
||||
context=$1
|
||||
shift
|
||||
fi
|
||||
|
||||
local extra_args=("$@")
|
||||
|
||||
echo "🔄 Building $image_name image..."
|
||||
echo " Dockerfile: $dockerfile_path"
|
||||
echo " Tag: $tag"
|
||||
echo " Context: $context"
|
||||
|
||||
local tries=${ARGUS_BUILD_RETRIES:-3}
|
||||
local delay=${ARGUS_BUILD_RETRY_DELAY:-5}
|
||||
local attempt=1
|
||||
while (( attempt <= tries )); do
|
||||
local prefix=""
|
||||
if (( attempt == tries )); then
|
||||
# final attempt: disable BuildKit to avoid docker/dockerfile front-end pulls
|
||||
prefix="DOCKER_BUILDKIT=0"
|
||||
echo " Attempt ${attempt}/${tries} (fallback: DOCKER_BUILDKIT=0)"
|
||||
else
|
||||
echo " Attempt ${attempt}/${tries}"
|
||||
fi
|
||||
if eval $prefix docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" "$context"; then
|
||||
echo "✅ $image_name image built successfully"
|
||||
return 0
|
||||
fi
|
||||
echo "⚠️ Build failed for $image_name (attempt ${attempt}/${tries})."
|
||||
if (( attempt < tries )); then
|
||||
echo " Retrying in ${delay}s..."
|
||||
sleep "$delay"
|
||||
fi
|
||||
attempt=$((attempt+1))
|
||||
done
|
||||
echo "❌ Failed to build $image_name image after ${tries} attempts"
|
||||
return 1
|
||||
}
|
||||
|
||||
pull_base_image() {
|
||||
local image_ref=$1
|
||||
local attempts=${2:-3}
|
||||
local delay=${3:-5}
|
||||
|
||||
# If the image already exists locally, skip pulling.
|
||||
if docker image inspect "$image_ref" >/dev/null 2>&1; then
|
||||
echo " Local image present; skip pull: $image_ref"
|
||||
if docker build "${build_args[@]}" "${extra_args[@]}" -f "$dockerfile_path" -t "$tag" .; then
|
||||
echo "✅ $image_name image built successfully"
|
||||
return 0
|
||||
else
|
||||
echo "❌ Failed to build $image_name image"
|
||||
return 1
|
||||
fi
|
||||
|
||||
for ((i=1; i<=attempts; i++)); do
|
||||
echo " Pulling base image ($i/$attempts): $image_ref"
|
||||
if docker pull "$image_ref" >/dev/null; then
|
||||
echo " Base image ready: $image_ref"
|
||||
return 0
|
||||
fi
|
||||
echo " Pull failed: $image_ref"
|
||||
if (( i < attempts )); then
|
||||
echo " Retrying in ${delay}s..."
|
||||
sleep "$delay"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "❌ Unable to pull base image after ${attempts} attempts: $image_ref"
|
||||
return 1
|
||||
}
|
||||
|
||||
images_built=()
|
||||
build_failed=false
|
||||
|
||||
build_gpu_bundle_image() {
|
||||
local date_tag="$1" # e.g. 20251112
|
||||
local cuda_ver_local="$2" # e.g. 12.2.2
|
||||
local client_ver="$3" # semver like 1.43.0
|
||||
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:latest"; then
|
||||
images_built+=("argus-elasticsearch:latest")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ gpu_bundle requires --version YYMMDD (e.g. 20251112)" >&2
|
||||
return 1
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# sanitize cuda version (trim trailing dots like '12.2.')
|
||||
while [[ "$cuda_ver_local" == *"." ]]; do cuda_ver_local="${cuda_ver_local%.}"; done
|
||||
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:latest"; then
|
||||
images_built+=("argus-kibana:latest")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
# Resolve effective CUDA base tag
|
||||
local resolve_cuda_base_tag
|
||||
resolve_cuda_base_tag() {
|
||||
local want="$1" # can be 12, 12.2 or 12.2.2
|
||||
local major minor patch
|
||||
if [[ "$want" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)$ ]]; then
|
||||
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"; patch="${BASH_REMATCH[3]}"
|
||||
echo "nvidia/cuda:${major}.${minor}.${patch}-runtime-ubuntu22.04"; return 0
|
||||
elif [[ "$want" =~ ^([0-9]+)\.([0-9]+)$ ]]; then
|
||||
major="${BASH_REMATCH[1]}"; minor="${BASH_REMATCH[2]}"
|
||||
# try to find best local patch for major.minor
|
||||
local best
|
||||
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
|
||||
grep -E "^nvidia/cuda:${major}\.${minor}\\.[0-9]+-runtime-ubuntu22\.04$" | \
|
||||
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.)([0-9]+)-runtime-ubuntu22\.04$#\1\2#g' | \
|
||||
sort -V | tail -n1 || true)
|
||||
if [[ -n "$best" ]]; then
|
||||
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
|
||||
fi
|
||||
# fallback patch if none local
|
||||
echo "nvidia/cuda:${major}.${minor}.2-runtime-ubuntu22.04"; return 0
|
||||
elif [[ "$want" =~ ^([0-9]+)$ ]]; then
|
||||
major="${BASH_REMATCH[1]}"
|
||||
# try to find best local for this major
|
||||
local best
|
||||
best=$(docker images --format '{{.Repository}}:{{.Tag}}' nvidia/cuda 2>/dev/null | \
|
||||
grep -E "^nvidia/cuda:${major}\\.[0-9]+\\.[0-9]+-runtime-ubuntu22\.04$" | \
|
||||
sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#g' | \
|
||||
sort -V | tail -n1 || true)
|
||||
if [[ -n "$best" ]]; then
|
||||
echo "nvidia/cuda:${best}-runtime-ubuntu22.04"; return 0
|
||||
fi
|
||||
echo "nvidia/cuda:${major}.2.2-runtime-ubuntu22.04"; return 0
|
||||
else
|
||||
# invalid format, fallback to default
|
||||
echo "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; return 0
|
||||
fi
|
||||
}
|
||||
echo ""
|
||||
|
||||
local base_image
|
||||
base_image=$(resolve_cuda_base_tag "$cuda_ver_local")
|
||||
|
||||
echo
|
||||
echo "🔧 Preparing one-click GPU bundle build"
|
||||
echo " CUDA runtime base: ${base_image}"
|
||||
echo " Bundle tag : ${date_tag}"
|
||||
|
||||
# 1) Ensure NVIDIA base image (skip pull if local)
|
||||
if ! pull_base_image "$base_image"; then
|
||||
# try once more with default if resolution failed
|
||||
if ! pull_base_image "nvidia/cuda:12.2.2-runtime-ubuntu22.04"; then
|
||||
return 1
|
||||
else
|
||||
base_image="nvidia/cuda:12.2.2-runtime-ubuntu22.04"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 2) Build latest argus-agent from source
|
||||
echo "\n🛠 Building argus-agent from src/agent"
|
||||
pushd "$root/src/agent" >/dev/null
|
||||
if ! bash scripts/build_binary.sh; then
|
||||
echo "❌ argus-agent build failed" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
if [[ ! -f "dist/argus-agent" ]]; then
|
||||
echo "❌ argus-agent binary missing after build" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
popd >/dev/null
|
||||
|
||||
# 3) Inject agent into all-in-one-full plugin and package artifact
|
||||
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
|
||||
local agent_bin_src="$root/src/agent/dist/argus-agent"
|
||||
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
|
||||
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
|
||||
cp -f "$agent_bin_src" "$agent_bin_dst"
|
||||
chmod +x "$agent_bin_dst" || true
|
||||
|
||||
pushd "$aio_root" >/dev/null
|
||||
local prev_version
|
||||
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
|
||||
local use_version="$prev_version"
|
||||
if [[ -n "$client_semver" ]]; then
|
||||
echo "${client_semver}" > config/VERSION
|
||||
use_version="$client_semver"
|
||||
fi
|
||||
echo " Packaging all-in-one-full artifact version: $use_version"
|
||||
if ! bash scripts/package_artifact.sh --force; then
|
||||
echo "❌ package_artifact.sh failed" >&2
|
||||
# restore VERSION if changed
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
|
||||
local artifact_dir="$aio_root/artifact/$use_version"
|
||||
local artifact_tar
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
if [[ -z "$artifact_tar" ]]; then
|
||||
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh to assemble..."
|
||||
local owner="$(id -u):$(id -g)"
|
||||
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
|
||||
echo "❌ publish_artifact.sh failed" >&2
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
fi
|
||||
if [[ -z "$artifact_tar" ]]; then
|
||||
echo "❌ artifact tar not found under $artifact_dir" >&2
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
# restore VERSION if changed (keep filesystem clean)
|
||||
if [[ -n "$client_semver" ]]; then echo "$prev_version" > config/VERSION; fi
|
||||
popd >/dev/null
|
||||
|
||||
# 4) Stage docker build context
|
||||
local bundle_ctx="$root/src/bundle/gpu-node-bundle/.build-$date_tag"
|
||||
echo "\n🧰 Staging docker build context: $bundle_ctx"
|
||||
rm -rf "$bundle_ctx"
|
||||
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
|
||||
cp "$root/src/bundle/gpu-node-bundle/Dockerfile" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/gpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/gpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
|
||||
# bundle tar
|
||||
cp "$artifact_tar" "$bundle_ctx/bundle/"
|
||||
# offline fluent-bit assets (optional but useful)
|
||||
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
|
||||
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
|
||||
fi
|
||||
|
||||
# 5) Build the final bundle image (directly from NVIDIA base)
|
||||
local image_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
|
||||
echo "\n🔄 Building GPU Bundle image"
|
||||
if build_image "GPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx" \
|
||||
--build-arg CUDA_VER="$(echo "$base_image" | sed -E 's#^nvidia/cuda:([0-9]+\.[0-9]+\.[0-9]+)-runtime-ubuntu22\.04$#\1#')" \
|
||||
--build-arg CLIENT_VER="$use_version" \
|
||||
--build-arg BUNDLE_DATE="$date_tag"; then
|
||||
images_built+=("$image_tag")
|
||||
# In non-pkg mode, also tag latest for convenience
|
||||
if [[ "${ARGUS_PKG_BUILD:-0}" != "1" ]]; then
|
||||
docker tag "$image_tag" argus-sys-metric-test-node-bundle-gpu:latest >/dev/null 2>&1 || true
|
||||
fi
|
||||
return 0
|
||||
else
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Tag helper: ensure :<date_tag> exists for a list of repos
|
||||
ensure_version_tags() {
|
||||
local date_tag="$1"; shift
|
||||
local repos=("$@")
|
||||
for repo in "${repos[@]}"; do
|
||||
if docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
|
||||
:
|
||||
elif docker image inspect "$repo:latest" >/dev/null 2>&1; then
|
||||
docker tag "$repo:latest" "$repo:$date_tag" || true
|
||||
else
|
||||
echo "❌ missing image for tagging: $repo (need :latest or :$date_tag)" >&2
|
||||
return 1
|
||||
fi
|
||||
done
|
||||
return 0
|
||||
}
|
||||
|
||||
# Build server package after images are built
|
||||
build_server_pkg_bundle() {
|
||||
local date_tag="$1"
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ server_pkg requires --version YYMMDD" >&2
|
||||
return 1
|
||||
fi
|
||||
local repos=(
|
||||
argus-master argus-elasticsearch argus-kibana \
|
||||
argus-metric-prometheus argus-metric-grafana \
|
||||
argus-alertmanager argus-web-frontend argus-web-proxy
|
||||
)
|
||||
echo "\n🔖 Verifying server images with :$date_tag and collecting digests (Bind/FTP excluded; relying on Docker DNS aliases)"
|
||||
for repo in "${repos[@]}"; do
|
||||
if ! docker image inspect "$repo:$date_tag" >/dev/null 2>&1; then
|
||||
echo "❌ required image missing: $repo:$date_tag (build phase should have produced it)" >&2
|
||||
return 1
|
||||
fi
|
||||
done
|
||||
# Optional: show digests
|
||||
for repo in "${repos[@]}"; do
|
||||
local digest
|
||||
digest=$(docker images --digests --format '{{.Repository}}:{{.Tag}} {{.Digest}}' | awk -v r="$repo:$date_tag" '$1==r{print $2}' | head -n1)
|
||||
printf ' • %s@%s\n' "$repo:$date_tag" "${digest:-<none>}"
|
||||
done
|
||||
echo "\n📦 Building server package via deployment_new/build/make_server_package.sh --version $date_tag"
|
||||
if ! "$root/deployment_new/build/make_server_package.sh" --version "$date_tag"; then
|
||||
echo "❌ make_server_package.sh failed" >&2
|
||||
return 1
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
# Build client package: ensure gpu bundle image exists, then package client_gpu
|
||||
build_client_pkg_bundle() {
|
||||
local date_tag="$1"
|
||||
local semver="$2"
|
||||
local cuda="$3"
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ client_pkg requires --version YYMMDD" >&2
|
||||
return 1
|
||||
fi
|
||||
local bundle_tag="argus-sys-metric-test-node-bundle-gpu:${date_tag}"
|
||||
if ! docker image inspect "$bundle_tag" >/dev/null 2>&1; then
|
||||
echo "\n🧩 GPU bundle image $bundle_tag missing; building it first..."
|
||||
ARGUS_PKG_BUILD=1
|
||||
export ARGUS_PKG_BUILD
|
||||
if ! build_gpu_bundle_image "$date_tag" "$cuda" "$semver"; then
|
||||
return 1
|
||||
fi
|
||||
else
|
||||
echo "\n✅ Using existing GPU bundle image: $bundle_tag"
|
||||
fi
|
||||
echo "\n📦 Building client GPU package via deployment_new/build/make_client_gpu_package.sh --version $date_tag --image $bundle_tag"
|
||||
if ! "$root/deployment_new/build/make_client_gpu_package.sh" --version "$date_tag" --image "$bundle_tag"; then
|
||||
echo "❌ make_client_gpu_package.sh failed" >&2
|
||||
return 1
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
# Build CPU bundle image directly FROM ubuntu:22.04 (no intermediate base)
|
||||
build_cpu_bundle_image() {
|
||||
local date_tag="$1" # e.g. 20251113
|
||||
local client_ver_in="$2" # semver like 1.43.0 (optional)
|
||||
local want_tag_latest="$3" # true/false
|
||||
|
||||
if [[ -z "$date_tag" ]]; then
|
||||
echo "❌ cpu_bundle requires --version YYMMDD" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
echo "\n🔧 Preparing one-click CPU bundle build"
|
||||
echo " Base: ubuntu:22.04"
|
||||
echo " Bundle tag: ${date_tag}"
|
||||
|
||||
# 1) Build latest argus-agent from source
|
||||
echo "\n🛠 Building argus-agent from src/agent"
|
||||
pushd "$root/src/agent" >/dev/null
|
||||
if ! bash scripts/build_binary.sh; then
|
||||
echo "❌ argus-agent build failed" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
if [[ ! -f "dist/argus-agent" ]]; then
|
||||
echo "❌ argus-agent binary missing after build" >&2
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
popd >/dev/null
|
||||
|
||||
# 2) Inject agent into all-in-one-full plugin and package artifact
|
||||
local aio_root="$root/src/metric/client-plugins/all-in-one-full"
|
||||
local agent_bin_src="$root/src/agent/dist/argus-agent"
|
||||
local agent_bin_dst="$aio_root/plugins/argus-agent/bin/argus-agent"
|
||||
echo "\n📦 Updating all-in-one-full agent binary → $agent_bin_dst"
|
||||
cp -f "$agent_bin_src" "$agent_bin_dst"
|
||||
chmod +x "$agent_bin_dst" || true
|
||||
|
||||
pushd "$aio_root" >/dev/null
|
||||
local prev_version use_version
|
||||
prev_version="$(cat config/VERSION 2>/dev/null || echo "1.0.0")"
|
||||
use_version="$prev_version"
|
||||
if [[ -n "$client_ver_in" ]]; then
|
||||
echo "$client_ver_in" > config/VERSION
|
||||
use_version="$client_ver_in"
|
||||
fi
|
||||
echo " Packaging all-in-one-full artifact: version=$use_version"
|
||||
if ! bash scripts/package_artifact.sh --force; then
|
||||
echo "❌ package_artifact.sh failed" >&2
|
||||
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
local artifact_dir="$aio_root/artifact/$use_version"
|
||||
local artifact_tar
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
if [[ -z "$artifact_tar" ]]; then
|
||||
echo " No argus-metric_*.tar.gz found; invoking publish_artifact.sh ..."
|
||||
local owner="$(id -u):$(id -g)"
|
||||
if ! bash scripts/publish_artifact.sh "$use_version" --output-dir "$artifact_dir" --owner "$owner"; then
|
||||
echo "❌ publish_artifact.sh failed" >&2
|
||||
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
|
||||
popd >/dev/null
|
||||
return 1
|
||||
fi
|
||||
artifact_tar="$(ls -1 "$artifact_dir"/argus-metric_*.tar.gz 2>/dev/null | head -n1 || true)"
|
||||
fi
|
||||
[[ -n "$client_ver_in" ]] && echo "$prev_version" > config/VERSION
|
||||
popd >/dev/null
|
||||
|
||||
# 3) Stage docker build context
|
||||
local bundle_ctx="$root/src/bundle/cpu-node-bundle/.build-$date_tag"
|
||||
echo "\n🧰 Staging docker build context: $bundle_ctx"
|
||||
rm -rf "$bundle_ctx"
|
||||
mkdir -p "$bundle_ctx/bundle" "$bundle_ctx/private"
|
||||
cp "$root/src/bundle/cpu-node-bundle/Dockerfile" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/cpu-node-bundle/node-bootstrap.sh" "$bundle_ctx/"
|
||||
cp "$root/src/bundle/cpu-node-bundle/health-watcher.sh" "$bundle_ctx/"
|
||||
# bundle tar
|
||||
cp "$artifact_tar" "$bundle_ctx/bundle/"
|
||||
# offline fluent-bit assets
|
||||
if [[ -d "$root/src/log/fluent-bit/build/etc" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/etc" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -d "$root/src/log/fluent-bit/build/packages" ]]; then
|
||||
cp -r "$root/src/log/fluent-bit/build/packages" "$bundle_ctx/private/"
|
||||
fi
|
||||
if [[ -f "$root/src/log/fluent-bit/build/start-fluent-bit.sh" ]]; then
|
||||
cp "$root/src/log/fluent-bit/build/start-fluent-bit.sh" "$bundle_ctx/private/"
|
||||
fi
|
||||
|
||||
# 4) Build final bundle image
|
||||
local image_tag="argus-sys-metric-test-node-bundle:${date_tag}"
|
||||
echo "\n🔄 Building CPU Bundle image"
|
||||
if build_image "CPU Bundle" "$bundle_ctx/Dockerfile" "$image_tag" "$bundle_ctx"; then
|
||||
images_built+=("$image_tag")
|
||||
if [[ "$want_tag_latest" == "true" ]]; then
|
||||
docker tag "$image_tag" argus-sys-metric-test-node-bundle:latest >/dev/null 2>&1 || true
|
||||
fi
|
||||
return 0
|
||||
else
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
if [[ "$build_core" == true ]]; then
|
||||
if build_image "Elasticsearch" "src/log/elasticsearch/build/Dockerfile" "argus-elasticsearch:${DEFAULT_IMAGE_TAG}"; then
|
||||
images_built+=("argus-elasticsearch:${DEFAULT_IMAGE_TAG}")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
|
||||
if build_image "Kibana" "src/log/kibana/build/Dockerfile" "argus-kibana:${DEFAULT_IMAGE_TAG}"; then
|
||||
images_built+=("argus-kibana:${DEFAULT_IMAGE_TAG}")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
|
||||
if [[ "$need_bind_image" == true ]]; then
|
||||
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:${DEFAULT_IMAGE_TAG}"; then
|
||||
images_built+=("argus-bind9:${DEFAULT_IMAGE_TAG}")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
if build_image "BIND9" "src/bind/build/Dockerfile" "argus-bind9:latest"; then
|
||||
images_built+=("argus-bind9:latest")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
@ -660,21 +152,18 @@ if [[ "$build_master" == true ]]; then
|
||||
echo ""
|
||||
echo "🔄 Building Master image..."
|
||||
pushd "$master_root" >/dev/null
|
||||
master_args=("--tag" "argus-master:${DEFAULT_IMAGE_TAG}")
|
||||
master_args=("--tag" "argus-master:latest")
|
||||
if [[ "$use_intranet" == true ]]; then
|
||||
master_args+=("--intranet")
|
||||
fi
|
||||
if [[ "$build_master_offline" == true ]]; then
|
||||
master_args+=("--offline")
|
||||
fi
|
||||
if [[ "$no_cache" == true ]]; then
|
||||
master_args+=("--no-cache")
|
||||
fi
|
||||
if ./scripts/build_images.sh "${master_args[@]}"; then
|
||||
if [[ "$build_master_offline" == true ]]; then
|
||||
images_built+=("argus-master:offline")
|
||||
else
|
||||
images_built+=("argus-master:${DEFAULT_IMAGE_TAG}")
|
||||
images_built+=("argus-master:latest")
|
||||
fi
|
||||
else
|
||||
build_failed=true
|
||||
@ -682,176 +171,6 @@ if [[ "$build_master" == true ]]; then
|
||||
popd >/dev/null
|
||||
fi
|
||||
|
||||
if [[ "$build_metric" == true ]]; then
|
||||
echo ""
|
||||
echo "Building Metric module images..."
|
||||
|
||||
metric_base_images=(
|
||||
"ubuntu/prometheus:3-24.04_stable"
|
||||
"grafana/grafana:11.1.0"
|
||||
)
|
||||
|
||||
if [[ "$need_metric_ftp" == true ]]; then
|
||||
metric_base_images+=("ubuntu:22.04")
|
||||
fi
|
||||
|
||||
for base_image in "${metric_base_images[@]}"; do
|
||||
if ! pull_base_image "$base_image"; then
|
||||
build_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
metric_builds=()
|
||||
if [[ "$need_metric_ftp" == true ]]; then
|
||||
metric_builds+=("Metric FTP|src/metric/ftp/build/Dockerfile|argus-metric-ftp:${DEFAULT_IMAGE_TAG}|src/metric/ftp/build")
|
||||
fi
|
||||
metric_builds+=(
|
||||
"Metric Prometheus|src/metric/prometheus/build/Dockerfile|argus-metric-prometheus:${DEFAULT_IMAGE_TAG}|src/metric/prometheus/build"
|
||||
"Metric Grafana|src/metric/grafana/build/Dockerfile|argus-metric-grafana:${DEFAULT_IMAGE_TAG}|src/metric/grafana/build"
|
||||
)
|
||||
|
||||
for build_spec in "${metric_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# Sys (system tests) node images
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_sys" == true ]]; then
|
||||
echo ""
|
||||
echo "Building Sys node images..."
|
||||
|
||||
sys_base_images=(
|
||||
"ubuntu:22.04"
|
||||
"nvidia/cuda:12.2.2-runtime-ubuntu22.04"
|
||||
)
|
||||
|
||||
for base_image in "${sys_base_images[@]}"; do
|
||||
if ! pull_base_image "$base_image"; then
|
||||
build_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
sys_builds=(
|
||||
"Sys Node|src/sys/build/node/Dockerfile|argus-sys-node:latest|."
|
||||
"Sys Metric Test Node|src/sys/build/test-node/Dockerfile|argus-sys-metric-test-node:latest|."
|
||||
"Sys Metric Test GPU Node|src/sys/build/test-gpu-node/Dockerfile|argus-sys-metric-test-gpu-node:latest|."
|
||||
)
|
||||
|
||||
for build_spec in "${sys_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# Web & Alert module images
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_web" == true || "$build_alert" == true ]]; then
|
||||
echo ""
|
||||
echo "Building Web and Alert module images..."
|
||||
|
||||
# Pre-pull commonly used base images for stability
|
||||
web_alert_base_images=(
|
||||
"node:20"
|
||||
"ubuntu:24.04"
|
||||
)
|
||||
|
||||
for base_image in "${web_alert_base_images[@]}"; do
|
||||
if ! pull_base_image "$base_image"; then
|
||||
build_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ "$build_web" == true ]]; then
|
||||
web_builds=(
|
||||
"Web Frontend|src/web/build_tools/frontend/Dockerfile|argus-web-frontend:${DEFAULT_IMAGE_TAG}|."
|
||||
"Web Proxy|src/web/build_tools/proxy/Dockerfile|argus-web-proxy:${DEFAULT_IMAGE_TAG}|."
|
||||
)
|
||||
for build_spec in "${web_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
|
||||
if [[ "$build_alert" == true ]]; then
|
||||
alert_builds=(
|
||||
"Alertmanager|src/alert/alertmanager/build/Dockerfile|argus-alertmanager:${DEFAULT_IMAGE_TAG}|."
|
||||
)
|
||||
for build_spec in "${alert_builds[@]}"; do
|
||||
IFS='|' read -r image_label dockerfile_path image_tag build_context <<< "$build_spec"
|
||||
if build_image "$image_label" "$dockerfile_path" "$image_tag" "$build_context"; then
|
||||
images_built+=("$image_tag")
|
||||
else
|
||||
build_failed=true
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# One-click GPU bundle (direct NVIDIA base)
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_gpu_bundle" == true ]]; then
|
||||
echo ""
|
||||
echo "Building one-click GPU bundle image..."
|
||||
if ! build_gpu_bundle_image "$bundle_date" "$cuda_ver" "$client_semver"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# One-click CPU bundle (from ubuntu:22.04)
|
||||
# =======================================
|
||||
if [[ "$build_cpu_bundle" == true ]]; then
|
||||
echo ""
|
||||
echo "Building one-click CPU bundle image..."
|
||||
if ! build_cpu_bundle_image "${bundle_date}" "${client_semver}" "${tag_latest}"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
# =======================================
|
||||
# One-click Server/Client packaging
|
||||
# =======================================
|
||||
|
||||
if [[ "$build_server_pkg" == true ]]; then
|
||||
echo ""
|
||||
echo "🧳 Building one-click Server package..."
|
||||
if ! build_server_pkg_bundle "${bundle_date}"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "$build_client_pkg" == true ]]; then
|
||||
echo ""
|
||||
echo "🧳 Building one-click Client-GPU package..."
|
||||
if ! build_client_pkg_bundle "${bundle_date}" "${client_semver}" "${cuda_ver}"; then
|
||||
build_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "======================================="
|
||||
echo "📦 Build Summary"
|
||||
echo "======================================="
|
||||
@ -878,6 +197,7 @@ if [[ "$build_master_offline" == true ]]; then
|
||||
echo ""
|
||||
echo "🧳 Master offline wheels 已解压到 $master_offline_dir"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "🚀 Next steps:"
|
||||
echo " ./build/save_images.sh --compress # 导出镜像"
|
||||
|
||||
@ -68,12 +68,6 @@ declare -A images=(
|
||||
["argus-kibana:latest"]="argus-kibana-latest.tar"
|
||||
["argus-bind9:latest"]="argus-bind9-latest.tar"
|
||||
["argus-master:offline"]="argus-master-offline.tar"
|
||||
["argus-metric-ftp:latest"]="argus-metric-ftp-latest.tar"
|
||||
["argus-metric-prometheus:latest"]="argus-metric-prometheus-latest.tar"
|
||||
["argus-metric-grafana:latest"]="argus-metric-grafana-latest.tar"
|
||||
["argus-web-frontend:latest"]="argus-web-frontend-latest.tar"
|
||||
["argus-web-proxy:latest"]="argus-web-proxy-latest.tar"
|
||||
["argus-alertmanager:latest"]="argus-alertmanager-latest.tar"
|
||||
)
|
||||
|
||||
# 函数:检查镜像是否存在
|
||||
@ -226,4 +220,4 @@ fi
|
||||
|
||||
echo ""
|
||||
echo "✅ Image export completed successfully!"
|
||||
echo ""
|
||||
echo ""
|
||||
@ -1,6 +0,0 @@
|
||||
# Default build-time UID/GID for Argus images
|
||||
# Override by creating configs/build_user.local.conf with the same format.
|
||||
# Syntax: KEY=VALUE, supports UID/GID only. Whitespace and lines starting with # are ignored.
|
||||
|
||||
UID=2133
|
||||
GID=2015
|
||||
1
deployment_new/.gitignore
vendored
1
deployment_new/.gitignore
vendored
@ -1 +0,0 @@
|
||||
artifact/
|
||||
@ -1,14 +0,0 @@
|
||||
# deployment_new
|
||||
|
||||
本目录用于新的部署打包与交付实现(不影响既有 `deployment/`)。
|
||||
|
||||
里程碑 M1(当前实现)
|
||||
- `build/make_server_package.sh`:生成 Server 包(逐服务镜像 tar.gz、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz)。
|
||||
- `build/make_client_gpu_package.sh`:生成 Client‑GPU 包(GPU bundle 镜像 tar.gz、busybox.tar、compose、.env.example、docs、private 骨架、manifest/checksums、打包 tar.gz)。
|
||||
|
||||
模板
|
||||
- `templates/server/compose/docker-compose.yml`:部署专用,镜像默认使用 `:${PKG_VERSION}` 版本 tag,可通过 `.env` 覆盖。
|
||||
- `templates/client_gpu/compose/docker-compose.yml`:GPU 节点专用,使用 `:${PKG_VERSION}` 版本 tag。
|
||||
|
||||
注意:M1 仅产出安装包,不包含安装脚本落地;安装/运维脚本将在 M2 落地并纳入包内。
|
||||
|
||||
@ -1,33 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
log() { echo -e "\033[0;34m[INFO]\033[0m $*"; }
|
||||
warn() { echo -e "\033[1;33m[WARN]\033[0m $*"; }
|
||||
err() { echo -e "\033[0;31m[ERR ]\033[0m $*" >&2; }
|
||||
|
||||
require_cmd() {
|
||||
local miss=0
|
||||
for c in "$@"; do
|
||||
if ! command -v "$c" >/dev/null 2>&1; then err "missing command: $c"; miss=1; fi
|
||||
done
|
||||
[[ $miss -eq 0 ]]
|
||||
}
|
||||
|
||||
today_version() { date +%Y%m%d; }
|
||||
|
||||
checksum_dir() {
|
||||
local dir="$1"; local out="$2"; : > "$out";
|
||||
(cd "$dir" && find . -type f -print0 | sort -z | xargs -0 sha256sum) >> "$out"
|
||||
}
|
||||
|
||||
make_dir() { mkdir -p "$1"; }
|
||||
|
||||
copy_tree() {
|
||||
local src="$1" dst="$2"; rsync -a --delete "$src/" "$dst/" 2>/dev/null || cp -r "$src/." "$dst/";
|
||||
}
|
||||
|
||||
gen_manifest() {
|
||||
local root="$1"; local out="$2"; : > "$out";
|
||||
(cd "$root" && find . -maxdepth 4 -type f -printf "%p\n" | sort) >> "$out"
|
||||
}
|
||||
|
||||
@ -1,131 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Make client GPU package (versioned gpu bundle image, compose, env, docs, busybox)
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/client_gpu"
|
||||
ART_ROOT="$ROOT_DIR/deployment_new/artifact/client_gpu"
|
||||
|
||||
# Use deployment_new local common helpers
|
||||
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
|
||||
. "$COMMON_SH"
|
||||
|
||||
usage(){ cat <<EOF
|
||||
Build Client-GPU Package (deployment_new)
|
||||
|
||||
Usage: $(basename "$0") --version YYYYMMDD [--image IMAGE[:TAG]]
|
||||
|
||||
Defaults:
|
||||
image = argus-sys-metric-test-node-bundle-gpu:latest
|
||||
|
||||
Outputs: deployment_new/artifact/client_gpu/<YYYYMMDD>/ and client_gpu_YYYYMMDD.tar.gz
|
||||
EOF
|
||||
}
|
||||
|
||||
VERSION=""
|
||||
IMAGE="argus-sys-metric-test-node-bundle-gpu:latest"
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--version) VERSION="$2"; shift 2;;
|
||||
--image) IMAGE="$2"; shift 2;;
|
||||
-h|--help) usage; exit 0;;
|
||||
*) err "unknown arg: $1"; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
|
||||
|
||||
require_cmd docker tar gzip
|
||||
|
||||
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
|
||||
PKG_DIR="$ART_ROOT/$VERSION"
|
||||
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
|
||||
|
||||
# 1) Save GPU bundle image with version tag
|
||||
if ! docker image inspect "$IMAGE" >/dev/null 2>&1; then
|
||||
err "missing image: $IMAGE"; exit 1; fi
|
||||
|
||||
REPO="${IMAGE%%:*}"; TAG_VER="$REPO:$VERSION"
|
||||
docker tag "$IMAGE" "$TAG_VER"
|
||||
out_tar="$STAGE/images/${REPO//\//-}-$VERSION.tar"
|
||||
docker save -o "$out_tar" "$TAG_VER"
|
||||
gzip -f "$out_tar"
|
||||
|
||||
# 2) Busybox tar for connectivity/overlay warmup (prefer local template; fallback to docker save)
|
||||
BB_SRC="$TEMPL_DIR/images/busybox.tar"
|
||||
if [[ -f "$BB_SRC" ]]; then
|
||||
cp "$BB_SRC" "$STAGE/images/busybox.tar"
|
||||
else
|
||||
if docker image inspect busybox:latest >/dev/null 2>&1 || docker pull busybox:latest >/dev/null 2>&1; then
|
||||
docker save -o "$STAGE/images/busybox.tar" busybox:latest
|
||||
log "Included busybox from local docker daemon"
|
||||
else
|
||||
warn "busybox image not found and cannot pull; skipping busybox.tar"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3) Compose + env template and docs/scripts from templates
|
||||
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
|
||||
ENV_EX="$STAGE/compose/.env.example"
|
||||
cat >"$ENV_EX" <<EOF
|
||||
# Generated by make_client_gpu_package.sh
|
||||
PKG_VERSION=$VERSION
|
||||
|
||||
NODE_GPU_BUNDLE_IMAGE_TAG=${REPO}:${VERSION}
|
||||
|
||||
# Compose project name (isolation from server stack)
|
||||
COMPOSE_PROJECT_NAME=argus-client
|
||||
|
||||
# Required (no defaults). Must be filled before install.
|
||||
AGENT_ENV=
|
||||
AGENT_USER=
|
||||
AGENT_INSTANCE=
|
||||
GPU_NODE_HOSTNAME=
|
||||
|
||||
# Overlay network (should match server包 overlay)
|
||||
ARGUS_OVERLAY_NET=argus-sys-net
|
||||
|
||||
# From cluster-info.env (server package output)
|
||||
SWARM_MANAGER_ADDR=
|
||||
SWARM_JOIN_TOKEN_WORKER=
|
||||
SWARM_JOIN_TOKEN_MANAGER=
|
||||
EOF
|
||||
|
||||
# 4) Docs from deployment_new templates
|
||||
CLIENT_DOC_SRC="$TEMPL_DIR/docs"
|
||||
if [[ -d "$CLIENT_DOC_SRC" ]]; then
|
||||
rsync -a "$CLIENT_DOC_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$CLIENT_DOC_SRC/." "$STAGE/docs/"
|
||||
fi
|
||||
|
||||
# Placeholder scripts (will be implemented in M2)
|
||||
cat >"$STAGE/scripts/README.md" <<'EOF'
|
||||
# Client-GPU Scripts (Placeholder)
|
||||
|
||||
本目录将在 M2 引入:
|
||||
- config.sh / install.sh
|
||||
|
||||
当前为占位,便于包结构审阅。
|
||||
EOF
|
||||
|
||||
# 5) Scripts (from deployment_new templates) and Private skeleton
|
||||
SCRIPTS_SRC="$TEMPL_DIR/scripts"
|
||||
if [[ -d "$SCRIPTS_SRC" ]]; then
|
||||
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
|
||||
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
|
||||
fi
|
||||
mkdir -p "$STAGE/private/argus/agent"
|
||||
|
||||
# 6) Manifest & checksums
|
||||
gen_manifest "$STAGE" "$STAGE/manifest.txt"
|
||||
checksum_dir "$STAGE" "$STAGE/checksums.txt"
|
||||
|
||||
# 7) Move to artifact dir and pack
|
||||
mkdir -p "$PKG_DIR"
|
||||
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
|
||||
|
||||
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
|
||||
OUT_TAR="$OUT_TAR_DIR/client_gpu_${VERSION}.tar.gz"
|
||||
log "Creating tarball: $OUT_TAR"
|
||||
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
|
||||
log "Client-GPU package ready: $PKG_DIR"
|
||||
echo "$OUT_TAR"
|
||||
@ -1,160 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Make server deployment package (versioned, per-image tars, full compose, docs, skeleton)
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
TEMPL_DIR="$ROOT_DIR/deployment_new/templates/server"
|
||||
ART_ROOT="$ROOT_DIR/deployment_new/artifact/server"
|
||||
|
||||
# Use deployment_new local common helpers
|
||||
COMMON_SH="$ROOT_DIR/deployment_new/build/common.sh"
|
||||
. "$COMMON_SH"
|
||||
|
||||
usage(){ cat <<EOF
|
||||
Build Server Deployment Package (deployment_new)
|
||||
|
||||
Usage: $(basename "$0") --version YYYYMMDD
|
||||
|
||||
Outputs: deployment_new/artifact/server/<YYYYMMDD>/ and server_YYYYMMDD.tar.gz
|
||||
EOF
|
||||
}
|
||||
|
||||
VERSION=""
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--version) VERSION="$2"; shift 2;;
|
||||
-h|--help) usage; exit 0;;
|
||||
*) err "unknown arg: $1"; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
if [[ -z "$VERSION" ]]; then VERSION="$(today_version)"; fi
|
||||
|
||||
require_cmd docker tar gzip awk sed
|
||||
|
||||
IMAGES=(
|
||||
argus-master
|
||||
argus-elasticsearch
|
||||
argus-kibana
|
||||
argus-metric-prometheus
|
||||
argus-metric-grafana
|
||||
argus-alertmanager
|
||||
argus-web-frontend
|
||||
argus-web-proxy
|
||||
)
|
||||
|
||||
STAGE="$(mktemp -d)"; trap 'rm -rf "$STAGE"' EXIT
|
||||
PKG_DIR="$ART_ROOT/$VERSION"
|
||||
mkdir -p "$PKG_DIR" "$STAGE/images" "$STAGE/compose" "$STAGE/docs" "$STAGE/scripts" "$STAGE/private/argus"
|
||||
|
||||
# 1) Save per-image tars with version tag
|
||||
log "Tagging and saving images (version=$VERSION)"
|
||||
for repo in "${IMAGES[@]}"; do
|
||||
if ! docker image inspect "$repo:latest" >/dev/null 2>&1 && ! docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
|
||||
err "missing image: $repo (need :latest or :$VERSION)"; exit 1; fi
|
||||
if docker image inspect "$repo:$VERSION" >/dev/null 2>&1; then
|
||||
tag="$repo:$VERSION"
|
||||
else
|
||||
docker tag "$repo:latest" "$repo:$VERSION"
|
||||
tag="$repo:$VERSION"
|
||||
fi
|
||||
out_tar="$STAGE/images/${repo//\//-}-$VERSION.tar"
|
||||
docker save -o "$out_tar" "$tag"
|
||||
gzip -f "$out_tar"
|
||||
done
|
||||
|
||||
# 2) Compose + env template
|
||||
cp "$TEMPL_DIR/compose/docker-compose.yml" "$STAGE/compose/docker-compose.yml"
|
||||
ENV_EX="$STAGE/compose/.env.example"
|
||||
cat >"$ENV_EX" <<EOF
|
||||
# Generated by make_server_package.sh
|
||||
PKG_VERSION=$VERSION
|
||||
|
||||
# Image tags (can be overridden). Default to versioned tags
|
||||
MASTER_IMAGE_TAG=argus-master:
|
||||
ES_IMAGE_TAG=argus-elasticsearch:
|
||||
KIBANA_IMAGE_TAG=argus-kibana:
|
||||
PROM_IMAGE_TAG=argus-metric-prometheus:
|
||||
GRAFANA_IMAGE_TAG=argus-metric-grafana:
|
||||
ALERT_IMAGE_TAG=argus-alertmanager:
|
||||
FRONT_IMAGE_TAG=argus-web-frontend:
|
||||
WEB_PROXY_IMAGE_TAG=argus-web-proxy:
|
||||
EOF
|
||||
sed -i "s#:\$#:${VERSION}#g" "$ENV_EX"
|
||||
|
||||
# Ports and defaults (based on swarm_tests .env.example)
|
||||
cat >>"$ENV_EX" <<'EOF'
|
||||
|
||||
# Host ports for server compose
|
||||
MASTER_PORT=32300
|
||||
ES_HTTP_PORT=9200
|
||||
KIBANA_PORT=5601
|
||||
PROMETHEUS_PORT=9090
|
||||
GRAFANA_PORT=3000
|
||||
ALERTMANAGER_PORT=9093
|
||||
WEB_PROXY_PORT_8080=8080
|
||||
WEB_PROXY_PORT_8081=8081
|
||||
WEB_PROXY_PORT_8082=8082
|
||||
WEB_PROXY_PORT_8083=8083
|
||||
WEB_PROXY_PORT_8084=8084
|
||||
WEB_PROXY_PORT_8085=8085
|
||||
|
||||
# Overlay network name
|
||||
ARGUS_OVERLAY_NET=argus-sys-net
|
||||
|
||||
# UID/GID for volume ownership
|
||||
ARGUS_BUILD_UID=2133
|
||||
ARGUS_BUILD_GID=2015
|
||||
|
||||
# Compose project name (isolation from other stacks on same host)
|
||||
COMPOSE_PROJECT_NAME=argus-server
|
||||
EOF
|
||||
|
||||
# 3) Docs (from deployment_new templates)
|
||||
DOCS_SRC="$TEMPL_DIR/docs"
|
||||
if [[ -d "$DOCS_SRC" ]]; then
|
||||
rsync -a "$DOCS_SRC/" "$STAGE/docs/" >/dev/null 2>&1 || cp -r "$DOCS_SRC/." "$STAGE/docs/"
|
||||
fi
|
||||
|
||||
# 6) Scripts (from deployment_new templates)
|
||||
SCRIPTS_SRC="$TEMPL_DIR/scripts"
|
||||
if [[ -d "$SCRIPTS_SRC" ]]; then
|
||||
rsync -a "$SCRIPTS_SRC/" "$STAGE/scripts/" >/dev/null 2>&1 || cp -r "$SCRIPTS_SRC/." "$STAGE/scripts/"
|
||||
find "$STAGE/scripts" -type f -name '*.sh' -exec chmod +x {} + 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# 4) Private skeleton (minimum)
|
||||
mkdir -p \
|
||||
"$STAGE/private/argus/etc" \
|
||||
"$STAGE/private/argus/master" \
|
||||
"$STAGE/private/argus/metric/prometheus" \
|
||||
"$STAGE/private/argus/metric/prometheus/data" \
|
||||
"$STAGE/private/argus/metric/prometheus/rules" \
|
||||
"$STAGE/private/argus/metric/prometheus/targets" \
|
||||
"$STAGE/private/argus/metric/grafana" \
|
||||
"$STAGE/private/argus/metric/grafana/data" \
|
||||
"$STAGE/private/argus/metric/grafana/logs" \
|
||||
"$STAGE/private/argus/metric/grafana/plugins" \
|
||||
"$STAGE/private/argus/metric/grafana/provisioning/datasources" \
|
||||
"$STAGE/private/argus/metric/grafana/provisioning/dashboards" \
|
||||
"$STAGE/private/argus/metric/grafana/data/sessions" \
|
||||
"$STAGE/private/argus/metric/grafana/data/dashboards" \
|
||||
"$STAGE/private/argus/metric/grafana/config" \
|
||||
"$STAGE/private/argus/alert/alertmanager" \
|
||||
"$STAGE/private/argus/log/elasticsearch" \
|
||||
"$STAGE/private/argus/log/kibana"
|
||||
|
||||
# 7) Manifest & checksums
|
||||
gen_manifest "$STAGE" "$STAGE/manifest.txt"
|
||||
checksum_dir "$STAGE" "$STAGE/checksums.txt"
|
||||
|
||||
# 8) Move to artifact dir and pack
|
||||
mkdir -p "$PKG_DIR"
|
||||
rsync -a "$STAGE/" "$PKG_DIR/" >/dev/null 2>&1 || cp -r "$STAGE/." "$PKG_DIR/"
|
||||
|
||||
OUT_TAR_DIR="$(dirname "$PKG_DIR")"
|
||||
OUT_TAR="$OUT_TAR_DIR/server_${VERSION}.tar.gz"
|
||||
log "Creating tarball: $OUT_TAR"
|
||||
(cd "$PKG_DIR/.." && tar -czf "$OUT_TAR" "$(basename "$PKG_DIR")")
|
||||
log "Server package ready: $PKG_DIR"
|
||||
echo "$OUT_TAR"
|
||||
@ -1,38 +0,0 @@
|
||||
version: "3.8"
|
||||
|
||||
networks:
|
||||
argus-sys-net:
|
||||
external: true
|
||||
|
||||
services:
|
||||
metric-gpu-node:
|
||||
image: ${NODE_GPU_BUNDLE_IMAGE_TAG:-argus-sys-metric-test-node-bundle-gpu:${PKG_VERSION}}
|
||||
container_name: argus-metric-gpu-node-swarm
|
||||
hostname: ${GPU_NODE_HOSTNAME}
|
||||
restart: unless-stopped
|
||||
privileged: true
|
||||
runtime: nvidia
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
- DEBIAN_FRONTEND=noninteractive
|
||||
- MASTER_ENDPOINT=${MASTER_ENDPOINT:-http://master.argus.com:3000}
|
||||
# Fluent Bit / 日志上报目标(固定域名)
|
||||
- ES_HOST=es.log.argus.com
|
||||
- ES_PORT=9200
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- AGENT_ENV=${AGENT_ENV}
|
||||
- AGENT_USER=${AGENT_USER}
|
||||
- AGENT_INSTANCE=${AGENT_INSTANCE}
|
||||
- NVIDIA_VISIBLE_DEVICES=all
|
||||
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
|
||||
- GPU_MODE=gpu
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- ${AGENT_INSTANCE}.node.argus.com
|
||||
volumes:
|
||||
- ../private/argus/agent:/private/argus/agent
|
||||
- ../logs/infer:/logs/infer
|
||||
- ../logs/train:/logs/train
|
||||
command: ["sleep", "infinity"]
|
||||
@ -1,73 +0,0 @@
|
||||
# Argus Client‑GPU 安装指南(deployment_new)
|
||||
|
||||
## 一、准备条件(开始前确认)
|
||||
- GPU 节点安装了 NVIDIA 驱动,`nvidia-smi` 正常;
|
||||
- Docker & Docker Compose v2 已安装;
|
||||
- 使用统一账户 `argus`(UID=2133,GID=2015)执行安装,并加入 `docker` 组(如已创建可跳过):
|
||||
```bash
|
||||
sudo groupadd --gid 2015 argus || true
|
||||
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
|
||||
sudo passwd argus
|
||||
sudo usermod -aG docker argus
|
||||
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
|
||||
```
|
||||
后续解压与执行(config/install/uninstall)均使用 `argus` 账户进行。
|
||||
- 从 Server 安装方拿到 `cluster-info.env`(包含 `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`;compose 架构下 BINDIP/FTPIP 不再使用)。
|
||||
|
||||
## 二、解包
|
||||
- `tar -xzf client_gpu_YYYYMMDD.tar.gz`
|
||||
- 进入目录:`cd client_gpu_YYYYMMDD/`
|
||||
- 你应当看到:`images/`(GPU bundle、busybox)、`compose/`、`scripts/`、`docs/`。
|
||||
|
||||
## 三、配置 config(预热 overlay + 生成 .env)
|
||||
命令:
|
||||
```
|
||||
cp /path/to/cluster-info.env ./ # 或 export CLUSTER_INFO=/abs/path/cluster-info.env
|
||||
./scripts/config.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 读取 `cluster-info.env` 并 `docker swarm join`(幂等);
|
||||
- 自动用 busybox 预热 external overlay `argus-sys-net`,等待最多 60s 直到本机可见;
|
||||
- 生成/更新 `compose/.env`:填入 `SWARM_*`,并“保留你已填写的 AGENT_* 与 GPU_NODE_HOSTNAME”(不会覆盖)。
|
||||
|
||||
看到什么才算成功:
|
||||
- 终端输出类似:`已预热 overlay=argus-sys-net 并生成 compose/.env;可执行 scripts/install.sh`;
|
||||
- `compose/.env` 至少包含:
|
||||
- `AGENT_ENV/AGENT_USER/AGENT_INSTANCE/GPU_NODE_HOSTNAME`(需要你提前填写);
|
||||
- `SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_*`;
|
||||
- `NODE_GPU_BUNDLE_IMAGE_TAG=...:YYYYMMDD`。
|
||||
|
||||
### 日志映射(重要)
|
||||
- 容器内 `/logs/infer` 与 `/logs/train` 已映射到包根 `./logs/infer` 与 `./logs/train`:
|
||||
- 你可以直接在宿主机查看推理/训练日志:`tail -f logs/infer/*.log`、`tail -f logs/train/*.log`;
|
||||
- install 脚本会自动创建这两个目录。
|
||||
|
||||
若提示缺少必填项:
|
||||
- 打开 `compose/.env` 按提示补齐 `AGENT_*` 与 `GPU_NODE_HOSTNAME`,再次执行 `./scripts/config.sh`(脚本不会覆盖你已填的值)。
|
||||
|
||||
## 四、安装 install(加载镜像 + 起容器 + 跟日志)
|
||||
命令:
|
||||
```
|
||||
./scripts/install.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 如有必要,先自动预热 overlay;
|
||||
- 从 `images/` 导入 `argus-sys-metric-test-node-bundle-gpu-*.tar.gz` 到本地 Docker;
|
||||
- `docker compose up -d` 启动 GPU 节点容器,并自动执行 `docker logs -f argus-metric-gpu-node-swarm` 跟踪安装过程。
|
||||
|
||||
看到什么才算成功:
|
||||
- 日志中出现:`[BOOT] local bundle install OK: version=...` / `dcgm-exporter ... listening` / `node state present: /private/argus/agent/<hostname>/node.json`;
|
||||
- `docker exec argus-metric-gpu-node-swarm nvidia-smi -L` 能列出 GPU;
|
||||
- 在 Server 侧 Prometheus `/api/v1/targets` 中,GPU 节点 9100(node-exporter)与 9400(dcgm-exporter)至少其一 up。
|
||||
|
||||
## 五、卸载 uninstall
|
||||
命令:
|
||||
```
|
||||
./scripts/uninstall.sh
|
||||
```
|
||||
行为:Compose down(如有 .env),并删除 warmup 容器与节点容器。
|
||||
|
||||
## 六、常见问题
|
||||
- `本机未看到 overlay`:config/install 已自动预热;若仍失败,请检查与 manager 的网络连通性以及 manager 上是否已创建 `argus-sys-net`。
|
||||
- `busybox 缺失`:确保包根 `images/busybox.tar` 在,或主机已有 `busybox:latest`。
|
||||
- `加入 Swarm 失败`:确认 `cluster-info.env` 的 `SWARM_MANAGER_ADDR` 与 `SWARM_JOIN_TOKEN_WORKER` 正确,或在 manager 上重新 `docker swarm join-token -q worker` 后更新该文件。
|
||||
Binary file not shown.
@ -1,90 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_EX="$PKG_ROOT/compose/.env.example"
|
||||
ENV_OUT="$PKG_ROOT/compose/.env"
|
||||
|
||||
info(){ echo -e "\033[34m[CONFIG-GPU]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require docker curl jq awk sed tar gzip
|
||||
require_compose
|
||||
|
||||
# 磁盘空间检查(MB)
|
||||
check_disk(){ local p="$1"; local need=10240; local free
|
||||
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
|
||||
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
|
||||
}
|
||||
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
|
||||
|
||||
# 导入 cluster-info.env(默认取当前包根,也可用 CLUSTER_INFO 指定路径)
|
||||
CI_IN="${CLUSTER_INFO:-$PKG_ROOT/cluster-info.env}"
|
||||
info "读取 cluster-info.env: $CI_IN"
|
||||
[[ -f "$CI_IN" ]] || { err "找不到 cluster-info.env(默认当前包根,或设置环境变量 CLUSTER_INFO 指定绝对路径)"; exit 1; }
|
||||
set -a; source "$CI_IN"; set +a
|
||||
[[ -n "${SWARM_MANAGER_ADDR:-}" && -n "${SWARM_JOIN_TOKEN_WORKER:-}" ]] || { err "cluster-info.env 缺少 SWARM 信息(SWARM_MANAGER_ADDR/SWARM_JOIN_TOKEN_WORKER)"; exit 1; }
|
||||
|
||||
# 加入 Swarm(幂等)
|
||||
info "加入 Swarm(幂等):$SWARM_MANAGER_ADDR"
|
||||
docker swarm join --token "$SWARM_JOIN_TOKEN_WORKER" "$SWARM_MANAGER_ADDR":2377 >/dev/null 2>&1 || true
|
||||
|
||||
# 导入 busybox 并做 overlay 预热与连通性(总是执行)
|
||||
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
|
||||
# 准备 busybox
|
||||
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
|
||||
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then
|
||||
info "加载 busybox.tar 以预热 overlay"
|
||||
docker load -i "$PKG_ROOT/images/busybox.tar" >/dev/null
|
||||
else
|
||||
err "缺少 busybox 镜像(包内 images/busybox.tar 或本地 busybox:latest),无法预热 overlay $NET_NAME"; exit 1
|
||||
fi
|
||||
fi
|
||||
# 预热容器(worker 侧加入 overlay 以便本地可见)
|
||||
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
|
||||
info "启动 warmup 容器加入 overlay: $NET_NAME"
|
||||
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
|
||||
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && { info "overlay 可见 (t=${i}s)"; break; }; sleep 1; done
|
||||
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
|
||||
|
||||
# 通过 warmup 容器测试实际数据通路(alias → master)
|
||||
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
|
||||
err "warmup 容器内无法通过别名访问 master.argus.com;请确认 server compose 已启动并加入 overlay $NET_NAME"
|
||||
exit 1
|
||||
fi
|
||||
info "warmup 容器内可达 master.argus.com(Docker DNS + alias 正常)"
|
||||
|
||||
# 生成/更新 .env(保留人工填写项,不覆盖已有键)
|
||||
if [[ ! -f "$ENV_OUT" ]]; then
|
||||
cp "$ENV_EX" "$ENV_OUT"
|
||||
fi
|
||||
|
||||
set_kv(){ local k="$1" v="$2"; if grep -q "^${k}=" "$ENV_OUT"; then sed -i -E "s#^${k}=.*#${k}=${v}#" "$ENV_OUT"; else echo "${k}=${v}" >> "$ENV_OUT"; fi }
|
||||
|
||||
set_kv SWARM_MANAGER_ADDR "${SWARM_MANAGER_ADDR:-}"
|
||||
set_kv SWARM_JOIN_TOKEN_WORKER "${SWARM_JOIN_TOKEN_WORKER:-}"
|
||||
set_kv SWARM_JOIN_TOKEN_MANAGER "${SWARM_JOIN_TOKEN_MANAGER:-}"
|
||||
|
||||
REQ_VARS=(AGENT_ENV AGENT_USER AGENT_INSTANCE GPU_NODE_HOSTNAME)
|
||||
missing=()
|
||||
for v in "${REQ_VARS[@]}"; do
|
||||
val=$(grep -E "^$v=" "$ENV_OUT" | head -1 | cut -d= -f2-)
|
||||
if [[ -z "$val" ]]; then missing+=("$v"); fi
|
||||
done
|
||||
if [[ ${#missing[@]} -gt 0 ]]; then
|
||||
err "以下变量必须在 compose/.env 中填写:${missing[*]}(已保留你现有的内容,不会被覆盖)"; exit 1; fi
|
||||
|
||||
info "已生成 compose/.env;可执行 scripts/install.sh"
|
||||
|
||||
# 准备并赋权宿主日志目录(幂等,便于安装前人工检查/预创建)
|
||||
mkdir -p "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer"
|
||||
chmod 1777 "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" || true
|
||||
info "日志目录权限(期待 1777,含粘滞位):"
|
||||
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/train" "$PKG_ROOT/logs/infer" 2>/dev/null || true
|
||||
@ -1,72 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
info(){ echo -e "\033[34m[INSTALL-GPU]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require docker nvidia-smi
|
||||
require_compose
|
||||
|
||||
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env,请先运行 scripts/config.sh"; exit 1; }
|
||||
info "使用环境文件: $ENV_FILE"
|
||||
|
||||
# 预热 overlay(当 config 执行很久之前或容器已被清理时,warmup 可能不存在)
|
||||
set -a; source "$ENV_FILE"; set +a
|
||||
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
|
||||
info "检查 overlay 网络可见性: $NET_NAME"
|
||||
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
|
||||
# 如 Overlay 不可见,尝试用 busybox 预热(仅为确保 worker 节点已加入 overlay)
|
||||
if ! docker image inspect busybox:latest >/dev/null 2>&1; then
|
||||
if [[ -f "$PKG_ROOT/images/busybox.tar" ]]; then docker load -i "$PKG_ROOT/images/busybox.tar"; else err "缺少 busybox 镜像(images/busybox.tar 或本地 busybox:latest)"; exit 1; fi
|
||||
fi
|
||||
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
|
||||
docker run -d --rm --name argus-net-warmup --network "$NET_NAME" busybox:latest sleep 600 >/dev/null 2>&1 || true
|
||||
for i in {1..60}; do docker network inspect "$NET_NAME" >/dev/null 2>&1 && break; sleep 1; done
|
||||
docker network inspect "$NET_NAME" >/dev/null 2>&1 || { err "预热后仍未看到 overlay: $NET_NAME;请确认 manager 已创建并网络可达"; exit 1; }
|
||||
info "overlay 已可见(warmup=argus-net-warmup)"
|
||||
fi
|
||||
|
||||
# 若本函数内重新创建了 warmup 容器,同样测试一次 alias 数据通路
|
||||
if docker ps --format '{{.Names}}' | grep -q '^argus-net-warmup$'; then
|
||||
if ! docker exec argus-net-warmup sh -lc "ping -c 1 -W 2 master.argus.com >/dev/null 2>&1"; then
|
||||
err "GPU install 阶段:warmup 容器内无法通过别名访问 master.argus.com;请检查 overlay $NET_NAME 与 server 状态"
|
||||
exit 1
|
||||
fi
|
||||
info "GPU install 阶段:warmup 容器内可达 master.argus.com"
|
||||
fi
|
||||
|
||||
# 导入 GPU bundle 镜像
|
||||
IMG_TGZ=$(ls -1 "$PKG_ROOT"/images/argus-sys-metric-test-node-bundle-gpu-*.tar.gz 2>/dev/null | head -1 || true)
|
||||
[[ -n "$IMG_TGZ" ]] || { err "找不到 GPU bundle 镜像 tar.gz"; exit 1; }
|
||||
info "导入 GPU bundle 镜像: $(basename "$IMG_TGZ")"
|
||||
tmp=$(mktemp); gunzip -c "$IMG_TGZ" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
|
||||
|
||||
# 确保日志目录存在(宿主侧,用于映射 /logs/infer 与 /logs/train),并赋权 1777(粘滞位)
|
||||
mkdir -p "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train"
|
||||
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
|
||||
info "日志目录已准备并赋权 1777: logs/infer logs/train"
|
||||
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
|
||||
|
||||
# 启动 compose 并跟踪日志
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
|
||||
info "启动 GPU 节点 (docker compose -p $PROJECT up -d)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
|
||||
|
||||
# 再次校准宿主日志目录权限,避免容器内脚本对 bind mount 权限回退
|
||||
chmod 1777 "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" || true
|
||||
stat -c '%a %U:%G %n' "$PKG_ROOT/logs/infer" "$PKG_ROOT/logs/train" 2>/dev/null || true
|
||||
|
||||
info "跟踪节点容器日志(按 Ctrl+C 退出)"
|
||||
docker logs -f argus-metric-gpu-node-swarm || true
|
||||
@ -1,36 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
# load COMPOSE_PROJECT_NAME if provided in compose/.env
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-client}"
|
||||
|
||||
info(){ echo -e "\033[34m[UNINSTALL-GPU]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require_compose
|
||||
|
||||
if [[ -f "$ENV_FILE" ]]; then
|
||||
info "stopping compose project (project=$PROJECT)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
|
||||
else
|
||||
info "compose/.env not found; attempting to remove container by name"
|
||||
fi
|
||||
|
||||
# remove warmup container if still running
|
||||
docker rm -f argus-net-warmup >/dev/null 2>&1 || true
|
||||
|
||||
# remove node container if present
|
||||
docker rm -f argus-metric-gpu-node-swarm >/dev/null 2>&1 || true
|
||||
|
||||
info "uninstall completed"
|
||||
@ -1,169 +0,0 @@
|
||||
version: "3.8"
|
||||
|
||||
networks:
|
||||
argus-sys-net:
|
||||
external: true
|
||||
|
||||
services:
|
||||
master:
|
||||
image: ${MASTER_IMAGE_TAG:-argus-master:${PKG_VERSION}}
|
||||
container_name: argus-master-sys
|
||||
environment:
|
||||
- OFFLINE_THRESHOLD_SECONDS=6
|
||||
- ONLINE_THRESHOLD_SECONDS=2
|
||||
- SCHEDULER_INTERVAL_SECONDS=1
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
ports:
|
||||
- "${MASTER_PORT:-32300}:3000"
|
||||
volumes:
|
||||
- ../private/argus/master:/private/argus/master
|
||||
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- master.argus.com
|
||||
restart: unless-stopped
|
||||
|
||||
es:
|
||||
image: ${ES_IMAGE_TAG:-argus-elasticsearch:${PKG_VERSION}}
|
||||
container_name: argus-es-sys
|
||||
environment:
|
||||
- discovery.type=single-node
|
||||
- xpack.security.enabled=false
|
||||
- ES_JAVA_OPTS=-Xms512m -Xmx512m
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/log/elasticsearch:/private/argus/log/elasticsearch
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
ports:
|
||||
- "${ES_HTTP_PORT:-9200}:9200"
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- es.log.argus.com
|
||||
|
||||
kibana:
|
||||
image: ${KIBANA_IMAGE_TAG:-argus-kibana:${PKG_VERSION}}
|
||||
container_name: argus-kibana-sys
|
||||
environment:
|
||||
- ELASTICSEARCH_HOSTS=http://es.log.argus.com:9200
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/log/kibana:/private/argus/log/kibana
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
depends_on: [es]
|
||||
ports:
|
||||
- "${KIBANA_PORT:-5601}:5601"
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- kibana.log.argus.com
|
||||
|
||||
prometheus:
|
||||
image: ${PROM_IMAGE_TAG:-argus-metric-prometheus:${PKG_VERSION}}
|
||||
container_name: argus-prometheus
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
- PROMETHEUS_BASE_PATH=/private/argus/metric/prometheus
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
ports:
|
||||
- "${PROMETHEUS_PORT:-9090}:9090"
|
||||
volumes:
|
||||
- ../private/argus/metric/prometheus:/private/argus/metric/prometheus
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- prom.metric.argus.com
|
||||
|
||||
grafana:
|
||||
image: ${GRAFANA_IMAGE_TAG:-argus-metric-grafana:${PKG_VERSION}}
|
||||
container_name: argus-grafana
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
- GRAFANA_BASE_PATH=/private/argus/metric/grafana
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- GF_SERVER_HTTP_PORT=3000
|
||||
- GF_LOG_LEVEL=warn
|
||||
- GF_LOG_MODE=console
|
||||
- GF_PATHS_PROVISIONING=/private/argus/metric/grafana/provisioning
|
||||
- GF_AUTH_ANONYMOUS_ENABLED=true
|
||||
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
|
||||
ports:
|
||||
- "${GRAFANA_PORT:-3000}:3000"
|
||||
volumes:
|
||||
- ../private/argus/metric/grafana:/private/argus/metric/grafana
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
depends_on: [prometheus]
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- grafana.metric.argus.com
|
||||
|
||||
alertmanager:
|
||||
image: ${ALERT_IMAGE_TAG:-argus-alertmanager:${PKG_VERSION}}
|
||||
container_name: argus-alertmanager
|
||||
environment:
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
- ../private/argus/alert/alertmanager:/private/argus/alert/alertmanager
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- alertmanager.alert.argus.com
|
||||
ports:
|
||||
- "${ALERTMANAGER_PORT:-9093}:9093"
|
||||
restart: unless-stopped
|
||||
|
||||
web-frontend:
|
||||
image: ${FRONT_IMAGE_TAG:-argus-web-frontend:${PKG_VERSION}}
|
||||
container_name: argus-web-frontend
|
||||
environment:
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
- EXTERNAL_MASTER_PORT=${WEB_PROXY_PORT_8085:-8085}
|
||||
- EXTERNAL_ALERTMANAGER_PORT=${WEB_PROXY_PORT_8084:-8084}
|
||||
- EXTERNAL_GRAFANA_PORT=${WEB_PROXY_PORT_8081:-8081}
|
||||
- EXTERNAL_PROMETHEUS_PORT=${WEB_PROXY_PORT_8082:-8082}
|
||||
- EXTERNAL_KIBANA_PORT=${WEB_PROXY_PORT_8083:-8083}
|
||||
volumes:
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- web.argus.com
|
||||
restart: unless-stopped
|
||||
|
||||
web-proxy:
|
||||
image: ${WEB_PROXY_IMAGE_TAG:-argus-web-proxy:${PKG_VERSION}}
|
||||
container_name: argus-web-proxy
|
||||
depends_on: [master, grafana, prometheus, kibana, alertmanager]
|
||||
environment:
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
volumes:
|
||||
- ../private/argus/etc:/private/argus/etc
|
||||
networks:
|
||||
argus-sys-net:
|
||||
aliases:
|
||||
- proxy.argus.com
|
||||
ports:
|
||||
- "${WEB_PROXY_PORT_8080:-8080}:8080"
|
||||
- "${WEB_PROXY_PORT_8081:-8081}:8081"
|
||||
- "${WEB_PROXY_PORT_8082:-8082}:8082"
|
||||
- "${WEB_PROXY_PORT_8083:-8083}:8083"
|
||||
- "${WEB_PROXY_PORT_8084:-8084}:8084"
|
||||
- "${WEB_PROXY_PORT_8085:-8085}:8085"
|
||||
restart: unless-stopped
|
||||
@ -1,102 +0,0 @@
|
||||
# Argus Server 安装指南(deployment_new)
|
||||
|
||||
适用:通过 Server 安装包在 Docker Swarm + external overlay 网络一体化部署 Argus 服务端组件。
|
||||
|
||||
—— 本文强调“怎么做、看什么、符合了才继续”。
|
||||
|
||||
## 一、准备条件(开始前确认)
|
||||
- Docker 与 Docker Compose v2 已安装;`docker info` 正常;`docker compose version` 可执行。
|
||||
- 具备 root/sudo 权限;磁盘可用空间 ≥ 10GB(包根与 `/var/lib/docker`)。
|
||||
- 你知道本机管理地址(SWARM_MANAGER_ADDR),该 IP 属于本机某网卡,可被其他节点访问。
|
||||
- 很重要:以统一账户 `argus`(UID=2133,GID=2015)执行后续安装与运维,并将其加入 `docker` 组;示例命令如下(如需不同 UID/GID,请替换为贵方标准):
|
||||
```bash
|
||||
# 1) 创建主组(GID=2015,组名 argus;若已存在可跳过)
|
||||
sudo groupadd --gid 2015 argus || true
|
||||
|
||||
# 2) 创建用户 argus(UID=2133、主组 GID=2015,创建家目录并用 bash 作为默认 shell;若已存在可用 usermod 调整)
|
||||
sudo useradd --uid 2133 --gid 2015 --create-home --shell /bin/bash argus || true
|
||||
sudo passwd argus
|
||||
|
||||
# 3) 将 argus 加入 docker 组,使其能调用 Docker Daemon(新登录后生效)
|
||||
sudo usermod -aG docker argus
|
||||
|
||||
# 4) 验证(重新登录或执行 newgrp docker 使组生效)
|
||||
su - argus -c 'id; docker ps >/dev/null && echo OK || echo NO_DOCKER_PERMISSION'
|
||||
```
|
||||
后续的解压与执行(config/install/selfcheck 等)均使用该 `argus` 账户进行。
|
||||
|
||||
## 二、解包与目录结构
|
||||
- 解压:`tar -xzf server_YYYYMMDD.tar.gz`。
|
||||
- 进入:`cd server_YYYYMMDD/`
|
||||
- 你应当能看到:
|
||||
- `images/`(逐服务镜像 tar.gz,如 `argus-master-YYYYMMDD.tar.gz`)
|
||||
- `compose/`(`docker-compose.yml` 与 `.env.example`)
|
||||
- `scripts/`(安装/运维脚本)
|
||||
- `private/argus/`(数据与配置骨架)
|
||||
- `docs/`(中文文档)
|
||||
|
||||
## 三、配置 config(生成 .env 与 SWARM_MANAGER_ADDR)
|
||||
命令:
|
||||
```
|
||||
export SWARM_MANAGER_ADDR=<本机管理IP>
|
||||
./scripts/config.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 检查依赖与磁盘空间;
|
||||
- 自动从“端口 20000 起”分配所有服务端口,确保“系统未占用”且“彼此不冲突”;
|
||||
- 写入 `compose/.env`(包含端口、镜像 tag、overlay 名称与 UID/GID 等);
|
||||
- 将当前执行账户的 UID/GID 写入 `ARGUS_BUILD_UID/GID`(若主组名是 docker,会改用“与用户名同名的组”的 GID,避免拿到 docker 组 999);
|
||||
- 更新/追加 `cluster-info.env` 中的 `SWARM_MANAGER_ADDR`(不会覆盖其他键)。
|
||||
|
||||
看到什么才算成功:
|
||||
- 终端输出:`已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。`
|
||||
- `compose/.env` 打开应当看到:
|
||||
- 端口均 ≥20000 且没有重复;
|
||||
- `ARGUS_BUILD_UID/GID` 与 `id -u/-g` 一致;
|
||||
- `SWARM_MANAGER_ADDR=<你的IP>`。
|
||||
|
||||
遇到问题:
|
||||
- 端口被异常占用:可删去 `.env` 后再次执行 `config.sh`,或手工编辑端口再执行 `install.sh`。
|
||||
|
||||
## 四、安装 install(一次到位)
|
||||
命令:
|
||||
```
|
||||
./scripts/install.sh
|
||||
```
|
||||
脚本做了什么:
|
||||
- 若 Swarm 未激活:执行 `docker swarm init --advertise-addr $SWARM_MANAGER_ADDR`;
|
||||
- 确保 external overlay `argus-sys-net` 存在;
|
||||
- 导入 `images/*.tar.gz` 到本机 Docker;
|
||||
- `docker compose up -d` 启动服务;
|
||||
- 等待“六项就绪”:
|
||||
- Master `/readyz`=200、ES `/_cluster/health`=200、Prometheus TCP 可达、Grafana `/api/health`=200、Alertmanager `/api/v2/status`=200、Kibana `/api/status` level=available;
|
||||
- 校验 Docker DNS + overlay alias:在 `argus-web-proxy` 内通过 `getent hosts` 与 `curl` 检查 `master.argus.com`、`grafana.metric.argus.com` 等域名连通性;
|
||||
- 写出 `cluster-info.env`(含 `SWARM_JOIN_TOKEN_{WORKER,MANAGER}/SWARM_MANAGER_ADDR`;compose 架构下不再依赖 BINDIP/FTPIP);
|
||||
- 生成 `安装报告_YYYYMMDD-HHMMSS.md`(端口、健康检查摘要与提示)。
|
||||
|
||||
看到什么才算成功:
|
||||
- `docker compose ps` 全部是 Up;
|
||||
- `安装报告_…md` 中各项 HTTP 检查为 200/available;
|
||||
- `cluster-info.env` 包含五个关键键:
|
||||
- `SWARM_MANAGER_ADDR=...`
|
||||
- `SWARM_MANAGER_ADDR=...` `SWARM_JOIN_TOKEN_*=...`
|
||||
- `SWARM_JOIN_TOKEN_WORKER=SWMTKN-...`
|
||||
- `SWARM_JOIN_TOKEN_MANAGER=SWMTKN-...`
|
||||
|
||||
## 五、健康自检与常用操作
|
||||
- 健康自检:`./scripts/selfcheck.sh`
|
||||
- 期望输出:`selfcheck OK -> logs/selfcheck.json`
|
||||
- 文件 `logs/selfcheck.json` 中 `overlay_net/es/kibana/master_readyz/prometheus/grafana/alertmanager/web_proxy_cors` 为 true。
|
||||
- 状态:`./scripts/status.sh`(相当于 `docker compose ps`)。
|
||||
- 诊断:`./scripts/diagnose.sh`(收集容器/HTTP/CORS/ES 细节,输出到 `logs/diagnose_*.log`)。
|
||||
- 卸载:`./scripts/uninstall.sh`(Compose down)。
|
||||
- ES 磁盘水位临时放宽/还原:`./scripts/es-watermark-relax.sh` / `./scripts/es-watermark-restore.sh`。
|
||||
|
||||
## 六、下一步:分发 cluster-info.env 给 Client
|
||||
- 将 `cluster-info.env` 拷贝给安装 Client 的同事;
|
||||
- 对方在 Client 机器的包根放置该文件(或设置 `CLUSTER_INFO=/绝对路径`)即可。
|
||||
|
||||
## 七、故障排查快览
|
||||
- Proxy 502 或 8080 连接复位:通常是 overlay alias 未生效或 web-proxy 尚未解析到其它服务;重跑 `install.sh`(会重启栈并在容器内校验 DNS),或查看 `logs/diagnose_error.log`。
|
||||
- Kibana 不 available:等待 1–2 分钟、查看 `argus-kibana-sys` 日志;
|
||||
- cluster-info.env 的 SWARM_MANAGER_ADDR 为空:重新 `export SWARM_MANAGER_ADDR=<IP>; ./scripts/config.sh` 或 `./scripts/install.sh`(会回读 `.env` 补写)。
|
||||
@ -1,7 +0,0 @@
|
||||
# Docker Swarm 部署要点
|
||||
|
||||
- 初始化 Swarm:`docker swarm init --advertise-addr <SWARM_MANAGER_ADDR>`
|
||||
- 创建 overlay:`docker network create --driver overlay --attachable argus-sys-net`
|
||||
- Server 包 `install.sh` 自动完成上述操作;如需手动执行,确保 `argus-sys-net` 存在且 attachable。
|
||||
- Worker 节点加入:`docker swarm join --token <worker_token> <SWARM_MANAGER_ADDR>:2377`。
|
||||
|
||||
@ -1,11 +0,0 @@
|
||||
# 故障排查(Server)
|
||||
|
||||
- 端口占用:查看 `安装报告_*.md` 中端口表;如需修改,编辑 `compose/.env` 后执行 `docker compose ... up -d`。
|
||||
- 组件未就绪:
|
||||
- Master: `curl http://127.0.0.1:${MASTER_PORT}/readyz -I`
|
||||
- ES: `curl http://127.0.0.1:${ES_HTTP_PORT}/_cluster/health`
|
||||
- Grafana: `curl http://127.0.0.1:${GRAFANA_PORT}/api/health`
|
||||
- Prometheus TCP: `exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT}`
|
||||
- 域名解析:进入 `argus-web-proxy` 或 `argus-master-sys` 容器:`getent hosts master.argus.com`。
|
||||
- Swarm/Overlay:检查 `docker network ls | grep argus-sys-net`,或 `docker node ls`。
|
||||
|
||||
@ -1,108 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_EX="$PKG_ROOT/compose/.env.example"
|
||||
ENV_OUT="$PKG_ROOT/compose/.env"
|
||||
|
||||
info(){ echo -e "\033[34m[CONFIG]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
|
||||
require docker curl jq awk sed tar gzip
|
||||
require_compose
|
||||
|
||||
# 磁盘空间检查(MB)
|
||||
check_disk(){ local p="$1"; local need=10240; local free
|
||||
free=$(df -Pm "$p" | awk 'NR==2{print $4+0}')
|
||||
if [[ -z "$free" || "$free" -lt "$need" ]]; then err "磁盘空间不足: $p 剩余 ${free:-0}MB (<${need}MB)"; return 1; fi
|
||||
}
|
||||
|
||||
check_disk "$PKG_ROOT"; check_disk "/var/lib/docker" || true
|
||||
|
||||
# 读取/生成 SWARM_MANAGER_ADDR
|
||||
SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}
|
||||
if [[ -z "${SWARM_MANAGER_ADDR}" ]]; then
|
||||
read -rp "请输入本机管理地址 SWARM_MANAGER_ADDR: " SWARM_MANAGER_ADDR
|
||||
fi
|
||||
info "SWARM_MANAGER_ADDR=$SWARM_MANAGER_ADDR"
|
||||
|
||||
# 校验 IP 属于本机网卡
|
||||
if ! ip -o addr | awk '{print $4}' | cut -d'/' -f1 | grep -qx "$SWARM_MANAGER_ADDR"; then
|
||||
err "SWARM_MANAGER_ADDR 非本机地址: $SWARM_MANAGER_ADDR"; exit 1; fi
|
||||
|
||||
info "开始分配服务端口(起始=20000,避免系统占用与相互冲突)"
|
||||
is_port_used(){ local p="$1"; ss -tulnH 2>/dev/null | awk '{print $5}' | sed 's/.*://g' | grep -qx "$p"; }
|
||||
declare -A PRESENT=() CHOSEN=() USED=()
|
||||
START_PORT="${START_PORT:-20000}"; cur=$START_PORT
|
||||
ORDER=(MASTER_PORT ES_HTTP_PORT KIBANA_PORT PROMETHEUS_PORT GRAFANA_PORT ALERTMANAGER_PORT \
|
||||
WEB_PROXY_PORT_8080 WEB_PROXY_PORT_8081 WEB_PROXY_PORT_8082 WEB_PROXY_PORT_8083 WEB_PROXY_PORT_8084 WEB_PROXY_PORT_8085 \
|
||||
FTP_PORT FTP_DATA_PORT)
|
||||
|
||||
# 标记 .env.example 中实际存在的键
|
||||
for key in "${ORDER[@]}"; do
|
||||
if grep -q "^${key}=" "$ENV_EX"; then PRESENT[$key]=1; fi
|
||||
done
|
||||
|
||||
next_free(){ local p="$1"; while :; do if [[ -n "${USED[$p]:-}" ]] || is_port_used "$p"; then p=$((p+1)); else echo "$p"; return; fi; done; }
|
||||
|
||||
for key in "${ORDER[@]}"; do
|
||||
[[ -z "${PRESENT[$key]:-}" ]] && continue
|
||||
p=$(next_free "$cur"); CHOSEN[$key]="$p"; USED[$p]=1; cur=$((p+1))
|
||||
done
|
||||
|
||||
info "端口分配结果:MASTER=${CHOSEN[MASTER_PORT]:-} ES=${CHOSEN[ES_HTTP_PORT]:-} KIBANA=${CHOSEN[KIBANA_PORT]:-} PROM=${CHOSEN[PROMETHEUS_PORT]:-} GRAFANA=${CHOSEN[GRAFANA_PORT]:-} ALERT=${CHOSEN[ALERTMANAGER_PORT]:-} WEB_PROXY(8080..8085)=${CHOSEN[WEB_PROXY_PORT_8080]:-}/${CHOSEN[WEB_PROXY_PORT_8081]:-}/${CHOSEN[WEB_PROXY_PORT_8082]:-}/${CHOSEN[WEB_PROXY_PORT_8083]:-}/${CHOSEN[WEB_PROXY_PORT_8084]:-}/${CHOSEN[WEB_PROXY_PORT_8085]:-}"
|
||||
|
||||
cp "$ENV_EX" "$ENV_OUT"
|
||||
# 覆盖端口(按唯一化结果写回)
|
||||
for key in "${ORDER[@]}"; do
|
||||
val="${CHOSEN[$key]:-}"
|
||||
[[ -z "$val" ]] && continue
|
||||
sed -i -E "s#^$key=.*#$key=${val}#" "$ENV_OUT"
|
||||
done
|
||||
info "已写入 compose/.env 的端口配置"
|
||||
# 覆盖/补充 Overlay 名称
|
||||
grep -q '^ARGUS_OVERLAY_NET=' "$ENV_OUT" || echo 'ARGUS_OVERLAY_NET=argus-sys-net' >> "$ENV_OUT"
|
||||
# 以当前执行账户 UID/GID 写入(避免误选 docker 组)
|
||||
RUID=$(id -u)
|
||||
PRIMARY_GID=$(id -g)
|
||||
PRIMARY_GRP=$(id -gn)
|
||||
USER_NAME=$(id -un)
|
||||
# 若主组名被解析为 docker,尝试用与用户名同名的组的 GID;否则回退主 GID
|
||||
if [[ "$PRIMARY_GRP" == "docker" ]]; then
|
||||
RGID=$(getent group "$USER_NAME" | awk -F: '{print $3}' 2>/dev/null || true)
|
||||
[[ -z "$RGID" ]] && RGID="$PRIMARY_GID"
|
||||
else
|
||||
RGID="$PRIMARY_GID"
|
||||
fi
|
||||
info "使用构建账户 UID:GID=${RUID}:${RGID} (user=$USER_NAME primary_group=$PRIMARY_GRP)"
|
||||
if grep -q '^ARGUS_BUILD_UID=' "$ENV_OUT"; then
|
||||
sed -i -E "s#^ARGUS_BUILD_UID=.*#ARGUS_BUILD_UID=${RUID}#" "$ENV_OUT"
|
||||
else
|
||||
echo "ARGUS_BUILD_UID=${RUID}" >> "$ENV_OUT"
|
||||
fi
|
||||
if grep -q '^ARGUS_BUILD_GID=' "$ENV_OUT"; then
|
||||
sed -i -E "s#^ARGUS_BUILD_GID=.*#ARGUS_BUILD_GID=${RGID}#" "$ENV_OUT"
|
||||
else
|
||||
echo "ARGUS_BUILD_GID=${RGID}" >> "$ENV_OUT"
|
||||
fi
|
||||
|
||||
CI="$PKG_ROOT/cluster-info.env"
|
||||
if [[ -f "$CI" ]]; then
|
||||
if grep -q '^SWARM_MANAGER_ADDR=' "$CI"; then
|
||||
sed -i -E "s#^SWARM_MANAGER_ADDR=.*#SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}#" "$CI"
|
||||
else
|
||||
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" >> "$CI"
|
||||
fi
|
||||
else
|
||||
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR}" > "$CI"
|
||||
fi
|
||||
info "已生成 compose/.env 并更新 cluster-info.env 的 SWARM_MANAGER_ADDR。"
|
||||
info "下一步可执行: scripts/install.sh"
|
||||
@ -1,109 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
|
||||
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
|
||||
|
||||
ts="$(date -u +%Y%m%d-%H%M%SZ)"
|
||||
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
|
||||
if ! ( : > "$LOG_DIR/.w" 2>/dev/null ); then LOG_DIR="/tmp/argus-logs"; mkdir -p "$LOG_DIR" || true; fi
|
||||
|
||||
# load compose project for accurate ps output
|
||||
ENV_FILE="$ROOT/compose/.env"
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
DETAILS="$LOG_DIR/diagnose_details_${ts}.log"; ERRORS="$LOG_DIR/diagnose_error_${ts}.log"; : > "$DETAILS"; : > "$ERRORS"
|
||||
|
||||
logd() { echo "$(date '+%F %T') $*" >> "$DETAILS"; }
|
||||
append_err() { echo "$*" >> "$ERRORS"; }
|
||||
http_code() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
|
||||
http_body_head() { curl -s --max-time 3 "$1" 2>/dev/null | sed -n '1,5p' || true; }
|
||||
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
|
||||
|
||||
section() { local name="$1"; logd "===== [$name] ====="; }
|
||||
svc() {
|
||||
local svc_name="$1"; local cname="$2"; shift 2
|
||||
section "$svc_name ($cname)"
|
||||
logd "docker ps:"; docker ps -a --format '{{.Names}}\t{{.Status}}\t{{.Image}}' | awk -v n="$cname" '$1==n' >> "$DETAILS" || true
|
||||
logd "docker inspect:"; docker inspect -f '{{.State.Status}} rc={{.RestartCount}} started={{.State.StartedAt}}' "$cname" >> "$DETAILS" 2>&1 || true
|
||||
logd "last 200 container logs:"; docker logs --tail 200 "$cname" >> "$DETAILS" 2>&1 || true
|
||||
docker logs --tail 200 "$cname" 2>&1 | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][container] /" >> "$ERRORS" || true
|
||||
if docker exec "$cname" sh -lc 'command -v supervisorctl >/dev/null 2>&1' >/dev/null 2>&1; then
|
||||
logd "supervisorctl status:"; docker exec "$cname" sh -lc 'supervisorctl status' >> "$DETAILS" 2>&1 || true
|
||||
local files; files=$(docker exec "$cname" sh -lc 'ls /var/log/supervisor/*.log 2>/dev/null' || true)
|
||||
for f in $files; do
|
||||
logd "tail -n 80 $f:"; docker exec "$cname" sh -lc "tail -n 80 $f 2>/dev/null || true" >> "$DETAILS" 2>&1 || true
|
||||
docker exec "$cname" sh -lc "tail -n 200 $f 2>/dev/null" 2>/dev/null | grep -Ei '\\b(error|failed|fail|exception|panic|fatal|critical|unhealthy|permission denied|forbidden|refused|traceback|错误|失败)\\b' | sed "s/^/[${svc_name}][supervisor:$(basename $f)] /" >> "$ERRORS" || true
|
||||
done
|
||||
fi
|
||||
}
|
||||
|
||||
svc master argus-master-sys
|
||||
svc es argus-es-sys
|
||||
svc kibana argus-kibana-sys
|
||||
svc prometheus argus-prometheus
|
||||
svc grafana argus-grafana
|
||||
svc alertmanager argus-alertmanager
|
||||
svc web-frontend argus-web-frontend
|
||||
svc web-proxy argus-web-proxy
|
||||
|
||||
section HTTP
|
||||
logd "ES: $(http_code \"http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health\")"; http_body_head "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" >> "$DETAILS" 2>&1 || true
|
||||
logd "Kibana: $(http_code \"http://localhost:${KIBANA_PORT:-5601}/api/status\")"; http_body_head "http://localhost:${KIBANA_PORT:-5601}/api/status" >> "$DETAILS" 2>&1 || true
|
||||
logd "Master readyz: $(http_code \"http://localhost:${MASTER_PORT:-32300}/readyz\")"
|
||||
logd "Prometheus: $(http_code \"http://localhost:${PROMETHEUS_PORT:-9090}/-/ready\")"
|
||||
logd "Grafana: $(http_code \"http://localhost:${GRAFANA_PORT:-3000}/api/health\")"; http_body_head "http://localhost:${GRAFANA_PORT:-3000}/api/health" >> "$DETAILS" 2>&1 || true
|
||||
logd "Alertmanager: $(http_code \"http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status\")"
|
||||
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
|
||||
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
|
||||
logd "Web-Proxy 8080: $(http_code \"http://localhost:${WEB_PROXY_PORT_8080:-8080}/\")"
|
||||
logd "Web-Proxy 8083: $(http_code \"http://localhost:${WEB_PROXY_PORT_8083:-8083}/\")"
|
||||
logd "Web-Proxy 8084 CORS: ${cors8084}"
|
||||
logd "Web-Proxy 8085 CORS: ${cors8085}"
|
||||
|
||||
section ES-CHECKS
|
||||
ch=$(curl -s --max-time 3 "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" || true)
|
||||
status=$(printf '%s' "$ch" | awk -F'\"' '/"status"/{print $4; exit}')
|
||||
if [[ -n "$status" ]]; then logd "cluster.status=$status"; fi
|
||||
if [[ "$status" != "green" ]]; then append_err "[es][cluster] status=$status"; fi
|
||||
if docker ps --format '{{.Names}}' | grep -q '^argus-es-sys$'; then
|
||||
duse=$(docker exec argus-es-sys sh -lc 'df -P /usr/share/elasticsearch/data | awk "NR==2{print \$5}"' 2>/dev/null || true)
|
||||
logd "es.data.df_use=$duse"; usep=${duse%%%}
|
||||
if [[ -n "$usep" ]] && (( usep >= 90 )); then append_err "[es][disk] data path usage=${usep}%"; fi
|
||||
fi
|
||||
|
||||
section DNS-IN-PROXY
|
||||
for d in master.argus.com es.log.argus.com kibana.log.argus.com grafana.metric.argus.com prom.metric.argus.com alertmanager.alert.argus.com; do
|
||||
docker exec argus-web-proxy sh -lc "getent hosts $d || nslookup $d 2>/dev/null | tail -n+1" >> "$DETAILS" 2>&1 || true
|
||||
done
|
||||
logd "HTTP (web-proxy): master.readyz=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://master.argus.com:3000/readyz\" 2>/dev/null || echo 000)"
|
||||
logd "HTTP (web-proxy): es.health=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://es.log.argus.com:9200/_cluster/health\" 2>/dev/null || echo 000)"
|
||||
logd "HTTP (web-proxy): kibana.status=$(docker exec argus-web-proxy sh -lc \"curl -s -o /dev/null -w '%{http_code}' http://kibana.log.argus.com:5601/api/status\" 2>/dev/null || echo 000)"
|
||||
|
||||
section SYSTEM
|
||||
logd "uname -a:"; uname -a >> "$DETAILS"
|
||||
logd "docker version:"; docker version --format '{{.Server.Version}}' >> "$DETAILS" 2>&1 || true
|
||||
logd "compose ps (project=$PROJECT):"; (cd "$ROOT/compose" && docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f docker-compose.yml ps) >> "$DETAILS" 2>&1 || true
|
||||
|
||||
section SUMMARY
|
||||
[[ $(http_code "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health") != 200 ]] && echo "[es][http] health not 200" >> "$ERRORS"
|
||||
kbcode=$(http_code "http://localhost:${KIBANA_PORT:-5601}/api/status"); [[ "$kbcode" != 200 ]] && echo "[kibana][http] /api/status=$kbcode" >> "$ERRORS"
|
||||
[[ $(http_code "http://localhost:${MASTER_PORT:-32300}/readyz") != 200 ]] && echo "[master][http] /readyz not 200" >> "$ERRORS"
|
||||
[[ $(http_code "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready") != 200 ]] && echo "[prometheus][http] /-/ready not 200" >> "$ERRORS"
|
||||
gfcode=$(http_code "http://localhost:${GRAFANA_PORT:-3000}/api/health"); [[ "$gfcode" != 200 ]] && echo "[grafana][http] /api/health=$gfcode" >> "$ERRORS"
|
||||
[[ $(http_code "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status") != 200 ]] && echo "[alertmanager][http] /api/v2/status not 200" >> "$ERRORS"
|
||||
[[ -z "$cors8084" ]] && echo "[web-proxy][cors] 8084 missing Access-Control-Allow-Origin" >> "$ERRORS"
|
||||
[[ -z "$cors8085" ]] && echo "[web-proxy][cors] 8085 missing Access-Control-Allow-Origin" >> "$ERRORS"
|
||||
sort -u -o "$ERRORS" "$ERRORS"
|
||||
|
||||
echo "Diagnostic details -> $DETAILS"
|
||||
echo "Detected errors -> $ERRORS"
|
||||
|
||||
if [[ "$LOG_DIR" == "$ROOT/logs" ]]; then
|
||||
ln -sfn "$(basename "$DETAILS")" "$ROOT/logs/diagnose_details.log" 2>/dev/null || cp "$DETAILS" "$ROOT/logs/diagnose_details.log" 2>/dev/null || true
|
||||
ln -sfn "$(basename "$ERRORS")" "$ROOT/logs/diagnose_error.log" 2>/dev/null || cp "$ERRORS" "$ROOT/logs/diagnose_error.log" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
exit 0
|
||||
@ -1,11 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
HOST="${1:-http://127.0.0.1:9200}"
|
||||
echo "设置 ES watermark 为 95%/96%/97%: $HOST"
|
||||
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
|
||||
"transient": {
|
||||
"cluster.routing.allocation.disk.watermark.low": "95%",
|
||||
"cluster.routing.allocation.disk.watermark.high": "96%",
|
||||
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
|
||||
}
|
||||
}' && echo "\nOK"
|
||||
@ -1,11 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
HOST="${1:-http://127.0.0.1:9200}"
|
||||
echo "恢复 ES watermark 为默认值: $HOST"
|
||||
curl -fsS -XPUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d '{
|
||||
"transient": {
|
||||
"cluster.routing.allocation.disk.watermark.low": null,
|
||||
"cluster.routing.allocation.disk.watermark.high": null,
|
||||
"cluster.routing.allocation.disk.watermark.flood_stage": null
|
||||
}
|
||||
}' && echo "\nOK"
|
||||
@ -1,137 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
info(){ echo -e "\033[34m[INSTALL]\033[0m $*"; }
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
require(){ local ok=1; for c in "$@"; do command -v "$c" >/dev/null 2>&1 || { err "缺少依赖: $c"; ok=0; }; done; [[ $ok -eq 1 ]]; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/devnull 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require docker curl jq awk sed tar gzip
|
||||
require_compose
|
||||
|
||||
[[ -f "$ENV_FILE" ]] || { err "缺少 compose/.env,请先运行 scripts/config.sh"; exit 1; }
|
||||
info "使用环境文件: $ENV_FILE"
|
||||
set -a; source "$ENV_FILE"; set +a
|
||||
# 兼容:若 .env 未包含 SWARM_MANAGER_ADDR,则从已存在的 cluster-info.env 读取以避免写空
|
||||
SMADDR="${SWARM_MANAGER_ADDR:-}"
|
||||
CI_FILE="$PKG_ROOT/cluster-info.env"
|
||||
if [[ -z "$SMADDR" && -f "$CI_FILE" ]]; then
|
||||
SMADDR=$(sed -n 's/^SWARM_MANAGER_ADDR=\(.*\)$/\1/p' "$CI_FILE" | head -n1)
|
||||
fi
|
||||
SWARM_MANAGER_ADDR="$SMADDR"
|
||||
|
||||
# Swarm init & overlay
|
||||
if ! docker info 2>/dev/null | grep -q "Swarm: active"; then
|
||||
[[ -n "${SWARM_MANAGER_ADDR:-}" ]] || { err "SWARM_MANAGER_ADDR 未设置,请在 scripts/config.sh 中配置"; exit 1; }
|
||||
info "初始化 Swarm (--advertise-addr $SWARM_MANAGER_ADDR)"
|
||||
docker swarm init --advertise-addr "$SWARM_MANAGER_ADDR" >/dev/null 2>&1 || true
|
||||
else
|
||||
info "Swarm 已激活"
|
||||
fi
|
||||
NET_NAME="${ARGUS_OVERLAY_NET:-argus-sys-net}"
|
||||
if ! docker network inspect "$NET_NAME" >/dev/null 2>&1; then
|
||||
info "创建 overlay 网络: $NET_NAME"
|
||||
docker network create -d overlay --attachable "$NET_NAME" >/dev/null
|
||||
else
|
||||
info "overlay 网络已存在: $NET_NAME"
|
||||
fi
|
||||
|
||||
# Load images
|
||||
IMAGES_DIR="$PKG_ROOT/images"
|
||||
shopt -s nullglob
|
||||
tars=("$IMAGES_DIR"/*.tar.gz)
|
||||
if [[ ${#tars[@]} -eq 0 ]]; then err "images 目录为空,缺少镜像 tar.gz"; exit 1; fi
|
||||
total=${#tars[@]}; idx=0
|
||||
for tgz in "${tars[@]}"; do
|
||||
idx=$((idx+1))
|
||||
info "导入镜像 ($idx/$total): $(basename "$tgz")"
|
||||
tmp=$(mktemp); gunzip -c "$tgz" > "$tmp"; docker load -i "$tmp" >/dev/null; rm -f "$tmp"
|
||||
done
|
||||
shopt -u nullglob
|
||||
|
||||
# Compose up
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
info "启动服务栈 (docker compose -p $PROJECT up -d)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" up -d
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
|
||||
|
||||
# Wait readiness (best-effort)
|
||||
code(){ curl -4 -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
|
||||
prom_ok(){ (exec 3<>/dev/tcp/127.0.0.1/${PROMETHEUS_PORT:-9090}) >/dev/null 2>&1 && return 0 || return 1; }
|
||||
kb_ok(){ local body; body=$(curl -s "http://127.0.0.1:${KIBANA_PORT:-5601}/api/status" || true); echo "$body" | grep -q '"level"\s*:\s*"available"'; }
|
||||
RETRIES=${RETRIES:-60}; SLEEP=${SLEEP:-5}; ok=0
|
||||
info "等待基础服务就绪 (<= $((RETRIES*SLEEP))s)"
|
||||
for i in $(seq 1 "$RETRIES"); do
|
||||
e1=$(code "http://127.0.0.1:${MASTER_PORT:-32300}/readyz")
|
||||
e2=$(code "http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health")
|
||||
e3=000; prom_ok && e3=200
|
||||
e4=$(code "http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health")
|
||||
e5=$(code "http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status")
|
||||
e6=$(kb_ok && echo 200 || echo 000)
|
||||
info "[ready] t=$((i*SLEEP))s master=$e1 es=$e2 prom=$e3 graf=$e4 alert=$e5 kibana=$e6"
|
||||
[[ "$e1" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e2" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e3" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e4" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e5" == 200 ]] && ok=$((ok+1))
|
||||
[[ "$e6" == 200 ]] && ok=$((ok+1))
|
||||
if [[ $ok -ge 6 ]]; then break; fi; ok=0; sleep "$SLEEP"
|
||||
done
|
||||
[[ $ok -ge 6 ]] || err "部分服务未就绪(可稍后重试 selfcheck)"
|
||||
|
||||
# Swarm join tokens
|
||||
TOKEN_WORKER=$(docker swarm join-token -q worker 2>/dev/null || echo "")
|
||||
TOKEN_MANAGER=$(docker swarm join-token -q manager 2>/dev/null || echo "")
|
||||
|
||||
# cluster-info.env(compose 场景下不再依赖 BINDIP/FTPIP)
|
||||
CI="$PKG_ROOT/cluster-info.env"
|
||||
info "写入 cluster-info.env (manager/token)"
|
||||
{
|
||||
echo "SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
|
||||
echo "SWARM_JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
|
||||
echo "SWARM_JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
|
||||
} > "$CI"
|
||||
info "已输出 $CI"
|
||||
|
||||
# 安装报告
|
||||
ts=$(date +%Y%m%d-%H%M%S)
|
||||
RPT="$PKG_ROOT/安装报告_${ts}.md"
|
||||
{
|
||||
echo "# Argus Server 安装报告 (${ts})"
|
||||
echo
|
||||
echo "## 端口映射"
|
||||
echo "- MASTER_PORT=${MASTER_PORT}"
|
||||
echo "- ES_HTTP_PORT=${ES_HTTP_PORT}"
|
||||
echo "- KIBANA_PORT=${KIBANA_PORT}"
|
||||
echo "- PROMETHEUS_PORT=${PROMETHEUS_PORT}"
|
||||
echo "- GRAFANA_PORT=${GRAFANA_PORT}"
|
||||
echo "- ALERTMANAGER_PORT=${ALERTMANAGER_PORT}"
|
||||
echo "- WEB_PROXY_PORT_8080=${WEB_PROXY_PORT_8080} ... 8085=${WEB_PROXY_PORT_8085}"
|
||||
echo
|
||||
echo "## Swarm/Overlay"
|
||||
echo "- SWARM_MANAGER_ADDR=${SWARM_MANAGER_ADDR:-}"
|
||||
echo "- NET=${NET_NAME}"
|
||||
echo "- JOIN_TOKEN_WORKER=${TOKEN_WORKER:-}"
|
||||
echo "- JOIN_TOKEN_MANAGER=${TOKEN_MANAGER:-}"
|
||||
echo
|
||||
echo "## 健康检查(简要)"
|
||||
echo "- master/readyz=$(code http://127.0.0.1:${MASTER_PORT:-32300}/readyz)"
|
||||
echo "- es/_cluster/health=$(code http://127.0.0.1:${ES_HTTP_PORT:-9200}/_cluster/health)"
|
||||
echo "- grafana/api/health=$(code http://127.0.0.1:${GRAFANA_PORT:-3000}/api/health)"
|
||||
echo "- prometheus/tcp=$([[ $(prom_ok; echo $?) == 0 ]] && echo 200 || echo 000)"
|
||||
echo "- alertmanager/api/v2/status=$(code http://127.0.0.1:${ALERTMANAGER_PORT:-9093}/api/v2/status)"
|
||||
echo "- kibana/api/status=$([[ $(kb_ok; echo $?) == 0 ]] && echo available || echo not-ready)"
|
||||
} > "$RPT"
|
||||
info "已生成报告: $RPT"
|
||||
|
||||
info "安装完成。可将 cluster-info.env 分发给 Client-GPU 安装方。"
|
||||
docker exec argus-web-proxy nginx -t >/dev/null 2>&1 && docker exec argus-web-proxy nginx -s reload >/dev/null 2>&1 || true
|
||||
@ -1,83 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
|
||||
log() { echo -e "\033[0;34m[CHECK]\033[0m $*"; }
|
||||
err() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }
|
||||
|
||||
ENV_FILE="$ROOT/compose/.env"; [[ -f "$ENV_FILE" ]] && set -a && source "$ENV_FILE" && set +a
|
||||
|
||||
wait_http() { local url="$1"; local attempts=${2:-120}; local i=1; while ((i<=attempts)); do curl -fsS "$url" >/dev/null 2>&1 && return 0; echo "[..] waiting $url ($i/$attempts)"; sleep 5; ((i++)); done; return 1; }
|
||||
code_for() { curl -s -o /dev/null -w "%{http_code}" "$1" || echo 000; }
|
||||
header_val() { curl -s -D - -o /dev/null "$@" | awk -F': ' 'BEGIN{IGNORECASE=1}$1=="Access-Control-Allow-Origin"{gsub("\r","",$2);print $2}'; }
|
||||
|
||||
LOG_DIR="$ROOT/logs"; mkdir -p "$LOG_DIR" || true
|
||||
OUT_JSON="$LOG_DIR/selfcheck.json"; tmp=$(mktemp)
|
||||
|
||||
ok=1
|
||||
|
||||
log "checking overlay network"
|
||||
net_ok=false
|
||||
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" >/dev/null 2>&1; then
|
||||
if docker network inspect "${ARGUS_OVERLAY_NET:-argus-sys-net}" | grep -q '"Driver": "overlay"'; then net_ok=true; fi
|
||||
fi
|
||||
[[ "$net_ok" == true ]] || ok=0
|
||||
|
||||
log "checking Elasticsearch"
|
||||
wait_http "http://localhost:${ES_HTTP_PORT:-9200}/_cluster/health" 60 || ok=0
|
||||
|
||||
log "checking Kibana"
|
||||
kb_code=$(code_for "http://localhost:${KIBANA_PORT:-5601}/api/status")
|
||||
kb_ok=false
|
||||
if [[ "$kb_code" == 200 ]]; then
|
||||
body=$(curl -sS "http://localhost:${KIBANA_PORT:-5601}/api/status" || true)
|
||||
echo "$body" | grep -q '"level"\s*:\s*"available"' && kb_ok=true
|
||||
fi
|
||||
[[ "$kb_ok" == true ]] || ok=0
|
||||
|
||||
log "checking Master"
|
||||
[[ $(code_for "http://localhost:${MASTER_PORT:-32300}/readyz") == 200 ]] || ok=0
|
||||
|
||||
log "checking Prometheus"
|
||||
wait_http "http://localhost:${PROMETHEUS_PORT:-9090}/-/ready" 60 || ok=0
|
||||
|
||||
log "checking Grafana"
|
||||
gf_code=$(code_for "http://localhost:${GRAFANA_PORT:-3000}/api/health")
|
||||
gf_ok=false; if [[ "$gf_code" == 200 ]]; then body=$(curl -sS "http://localhost:${GRAFANA_PORT:-3000}/api/health" || true); echo "$body" | grep -q '"database"\s*:\s*"ok"' && gf_ok=true; fi
|
||||
[[ "$gf_ok" == true ]] || ok=0
|
||||
|
||||
log "checking Alertmanager"
|
||||
wait_http "http://localhost:${ALERTMANAGER_PORT:-9093}/api/v2/status" 60 || ok=0
|
||||
|
||||
log "checking Web-Proxy (CORS)"
|
||||
cors8084=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8084:-8084}/api/v2/status" || true)
|
||||
cors8085=$(header_val -H "Origin: http://localhost:${WEB_PROXY_PORT_8080:-8080}" "http://localhost:${WEB_PROXY_PORT_8085:-8085}/api/v1/master/nodes" || true)
|
||||
wp_ok=true
|
||||
[[ -n "$cors8084" && -n "$cors8085" ]] || wp_ok=false
|
||||
[[ "$wp_ok" == true ]] || ok=0
|
||||
|
||||
cat > "$tmp" <<JSON
|
||||
{
|
||||
"overlay_net": $net_ok,
|
||||
"es": true,
|
||||
"kibana": $kb_ok,
|
||||
"master_readyz": true,
|
||||
"prometheus": true,
|
||||
"grafana": $gf_ok,
|
||||
"alertmanager": true,
|
||||
"web_proxy_cors": $wp_ok,
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
}
|
||||
JSON
|
||||
|
||||
mv "$tmp" "$OUT_JSON" 2>/dev/null || cp "$tmp" "$OUT_JSON"
|
||||
|
||||
if [[ "$ok" == 1 ]]; then
|
||||
log "selfcheck OK -> $OUT_JSON"
|
||||
exit 0
|
||||
else
|
||||
err "selfcheck FAILED -> $OUT_JSON"
|
||||
exit 1
|
||||
fi
|
||||
@ -1,9 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" ps
|
||||
@ -1,23 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
PKG_ROOT="$ROOT_DIR"
|
||||
ENV_FILE="$PKG_ROOT/compose/.env"
|
||||
COMPOSE_FILE="$PKG_ROOT/compose/docker-compose.yml"
|
||||
|
||||
# load COMPOSE_PROJECT_NAME from env file if present
|
||||
if [[ -f "$ENV_FILE" ]]; then set -a; source "$ENV_FILE"; set +a; fi
|
||||
PROJECT="${COMPOSE_PROJECT_NAME:-argus-server}"
|
||||
|
||||
err(){ echo -e "\033[31m[ERROR]\033[0m $*" >&2; }
|
||||
# Compose 检测:优先 docker compose(v2),回退 docker-compose(v1)
|
||||
require_compose(){
|
||||
if docker compose version >/dev/null 2>&1; then return 0; fi
|
||||
if command -v docker-compose >/dev/null 2>&1 && docker-compose version >/dev/null 2>&1; then return 0; fi
|
||||
err "未检测到 Docker Compose,请安装 docker compose v2 或 docker-compose v1"; exit 1
|
||||
}
|
||||
require_compose
|
||||
|
||||
echo "[UNINSTALL] stopping compose (project=$PROJECT)"
|
||||
docker compose -p "$PROJECT" --env-file "$ENV_FILE" -f "$COMPOSE_FILE" down --remove-orphans || true
|
||||
echo "[UNINSTALL] done"
|
||||
Binary file not shown.
@ -37,11 +37,22 @@ _argus_is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
_argus_read_user_from_files() {
|
||||
local uid_out_var="$1" gid_out_var="$2"; shift 2
|
||||
local uid_val="$ARGUS_BUILD_UID_DEFAULT" gid_val="$ARGUS_BUILD_GID_DEFAULT"
|
||||
local config
|
||||
for config in "$@"; do
|
||||
load_build_user() {
|
||||
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
local project_root config_files config uid gid
|
||||
project_root="$(argus_project_root)"
|
||||
config_files=(
|
||||
"$project_root/configs/build_user.local.conf"
|
||||
"$project_root/configs/build_user.conf"
|
||||
)
|
||||
|
||||
uid="$ARGUS_BUILD_UID_DEFAULT"
|
||||
gid="$ARGUS_BUILD_GID_DEFAULT"
|
||||
|
||||
for config in "${config_files[@]}"; do
|
||||
if [[ -f "$config" ]]; then
|
||||
while IFS= read -r raw_line || [[ -n "$raw_line" ]]; do
|
||||
local line key value
|
||||
@ -57,58 +68,42 @@ _argus_read_user_from_files() {
|
||||
key="$(_argus_trim "$key")"
|
||||
value="$(_argus_trim "$value")"
|
||||
case "$key" in
|
||||
UID) uid_val="$value" ;;
|
||||
GID) gid_val="$value" ;;
|
||||
*) echo "[ARGUS build_user] Unknown key '$key' in $config" >&2 ;;
|
||||
UID)
|
||||
uid="$value"
|
||||
;;
|
||||
GID)
|
||||
gid="$value"
|
||||
;;
|
||||
*)
|
||||
echo "[ARGUS build_user] Unknown key '$key' in $config" >&2
|
||||
;;
|
||||
esac
|
||||
done < "$config"
|
||||
break
|
||||
fi
|
||||
done
|
||||
printf -v "$uid_out_var" '%s' "$uid_val"
|
||||
printf -v "$gid_out_var" '%s' "$gid_val"
|
||||
}
|
||||
|
||||
load_build_user_profile() {
|
||||
local profile="${1:-default}"
|
||||
if [[ "$_ARGUS_BUILD_USER_LOADED" == "1" ]]; then
|
||||
return 0
|
||||
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then
|
||||
uid="$ARGUS_BUILD_UID"
|
||||
fi
|
||||
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then
|
||||
gid="$ARGUS_BUILD_GID"
|
||||
fi
|
||||
local project_root uid gid
|
||||
project_root="$(argus_project_root)"
|
||||
case "$profile" in
|
||||
pkg)
|
||||
_argus_read_user_from_files uid gid \
|
||||
"$project_root/configs/build_user.pkg.conf" \
|
||||
"$project_root/configs/build_user.local.conf" \
|
||||
"$project_root/configs/build_user.conf"
|
||||
;;
|
||||
default|*)
|
||||
_argus_read_user_from_files uid gid \
|
||||
"$project_root/configs/build_user.local.conf" \
|
||||
"$project_root/configs/build_user.conf"
|
||||
;;
|
||||
esac
|
||||
|
||||
if [[ -n "${ARGUS_BUILD_UID:-}" ]]; then uid="$ARGUS_BUILD_UID"; fi
|
||||
if [[ -n "${ARGUS_BUILD_GID:-}" ]]; then gid="$ARGUS_BUILD_GID"; fi
|
||||
|
||||
if ! _argus_is_number "$uid"; then
|
||||
echo "[ARGUS build_user] Invalid UID '$uid'" >&2; return 1
|
||||
echo "[ARGUS build_user] Invalid UID '$uid'" >&2
|
||||
return 1
|
||||
fi
|
||||
if ! _argus_is_number "$gid"; then
|
||||
echo "[ARGUS build_user] Invalid GID '$gid'" >&2; return 1
|
||||
echo "[ARGUS build_user] Invalid GID '$gid'" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
export ARGUS_BUILD_UID="$uid"
|
||||
export ARGUS_BUILD_GID="$gid"
|
||||
_ARGUS_BUILD_USER_LOADED=1
|
||||
}
|
||||
|
||||
load_build_user() {
|
||||
local profile="${ARGUS_BUILD_PROFILE:-default}"
|
||||
load_build_user_profile "$profile"
|
||||
}
|
||||
|
||||
argus_build_user_args() {
|
||||
load_build_user
|
||||
printf '%s' "--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} --build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID}"
|
||||
|
||||
1
src/agent/.gitignore
vendored
1
src/agent/.gitignore
vendored
@ -3,4 +3,3 @@ build/
|
||||
__pycache__/
|
||||
|
||||
.env
|
||||
dist/
|
||||
|
||||
@ -34,18 +34,6 @@ Agent 不再依赖配置文件;所有参数均由环境变量与主机名推
|
||||
| `MASTER_ENDPOINT` | 是 | N/A | Master 基础地址,可写 `http://host:3000` 或 `host:3000`(自动补全 `http://`)。 |
|
||||
| `REPORT_INTERVAL_SECONDS` | 否 | `60` | 状态上报间隔(秒)。必须为正整数。 |
|
||||
| `AGENT_HOSTNAME` | 否 | `$(hostname)` | 覆盖容器内主机名,便于测试或特殊命名需求。 |
|
||||
| `AGENT_ENV` | 否 | 来源于主机名 | 运行环境标识(如 `dev`、`prod`)。与 `AGENT_USER`、`AGENT_INSTANCE` 必须同时设置。 |
|
||||
| `AGENT_USER` | 否 | 来源于主机名 | 归属用户或团队标识。与 `AGENT_ENV`、`AGENT_INSTANCE` 必须同时设置。 |
|
||||
| `AGENT_INSTANCE` | 否 | 来源于主机名 | 实例编号或别名。与 `AGENT_ENV`、`AGENT_USER` 必须同时设置。 |
|
||||
|
||||
主机名与元数据的解析优先级:
|
||||
|
||||
1. 若设置 `AGENT_ENV` / `AGENT_USER` / `AGENT_INSTANCE` 且全部存在,则直接使用这些值。
|
||||
2. 否则检查历史 `node.json`(注册成功后由 Master 返回的信息),若包含 `env` / `user` / `instance` 则沿用。
|
||||
3. 若以上均不可用,则按历史约定从主机名解析 `env-user-instance` 前缀。
|
||||
4. 如果仍无法得到完整结果,Agent 启动会失败并提示需要提供上述环境变量。
|
||||
|
||||
> 提示:在首次部署时需确保环境变量或主机名能够提供完整信息。完成注册后,Agent 会把 Master 返回的元数据写入 `node.json`,后续重启无需再次提供环境变量就能保持一致性。
|
||||
|
||||
派生路径:
|
||||
|
||||
|
||||
@ -4,7 +4,6 @@ import os
|
||||
import re
|
||||
import socket
|
||||
import subprocess
|
||||
import ipaddress
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict
|
||||
|
||||
@ -17,50 +16,15 @@ _HOSTNAME_PATTERN = re.compile(r"^([^-]+)-([^-]+)-([^-]+)-.*$")
|
||||
|
||||
|
||||
def collect_metadata(config: AgentConfig) -> Dict[str, Any]:
|
||||
"""汇总节点注册需要的静态信息,带有更智能的 IP 选择。
|
||||
|
||||
规则(从高到低):
|
||||
1) AGENT_PUBLISH_IP 指定;
|
||||
2) Hostname A 记录(若命中优先网段);
|
||||
3) 网卡扫描:排除 AGENT_EXCLUDE_IFACES,优先 AGENT_PREFER_NET_CIDRS;
|
||||
4) 默认路由回退(UDP socket 技巧)。
|
||||
|
||||
额外发布:overlay_ip / gwbridge_ip / interfaces,便于 Master 与诊断使用。
|
||||
"""
|
||||
"""汇总节点注册需要的静态信息。"""
|
||||
hostname = config.hostname
|
||||
|
||||
prefer_cidrs = _read_cidrs_env(
|
||||
os.environ.get("AGENT_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16")
|
||||
)
|
||||
exclude_ifaces = _read_csv_env(
|
||||
os.environ.get("AGENT_EXCLUDE_IFACES", "docker_gwbridge,lo")
|
||||
)
|
||||
|
||||
# interface inventory
|
||||
interfaces = _list_global_ipv4_addrs()
|
||||
if exclude_ifaces:
|
||||
interfaces = [it for it in interfaces if it[0] not in set(exclude_ifaces)]
|
||||
|
||||
# resolve hostname candidates
|
||||
host_ips = _resolve_hostname_ips(hostname)
|
||||
|
||||
selected_ip, overlay_ip, gwbridge_ip = _select_publish_ips(
|
||||
interfaces=interfaces,
|
||||
host_ips=host_ips,
|
||||
prefer_cidrs=prefer_cidrs,
|
||||
)
|
||||
|
||||
meta: Dict[str, Any] = {
|
||||
env, user, instance = _parse_hostname(hostname)
|
||||
meta = {
|
||||
"hostname": hostname,
|
||||
"ip": os.environ.get("AGENT_PUBLISH_IP", selected_ip), # keep required field
|
||||
"overlay_ip": overlay_ip or selected_ip,
|
||||
"gwbridge_ip": gwbridge_ip,
|
||||
"interfaces": [
|
||||
{"iface": name, "ip": ip} for name, ip in interfaces
|
||||
],
|
||||
"env": config.environment,
|
||||
"user": config.user,
|
||||
"instance": config.instance,
|
||||
"ip": _detect_ip_address(),
|
||||
"env": env,
|
||||
"user": user,
|
||||
"instance": instance,
|
||||
"cpu_number": _detect_cpu_count(),
|
||||
"memory_in_bytes": _detect_memory_bytes(),
|
||||
"gpu_number": _detect_gpu_count(),
|
||||
@ -133,7 +97,7 @@ def _detect_gpu_count() -> int:
|
||||
|
||||
|
||||
def _detect_ip_address() -> str:
|
||||
"""保留旧接口,作为最终回退:默认路由源地址 → 主机名解析 → 127.0.0.1。"""
|
||||
"""尝试通过 UDP socket 获得容器出口 IP,失败则回退解析主机名。"""
|
||||
try:
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
|
||||
sock.connect(("8.8.8.8", 80))
|
||||
@ -145,118 +109,3 @@ def _detect_ip_address() -> str:
|
||||
except OSError:
|
||||
LOGGER.warning("Unable to resolve hostname to IP; defaulting to 127.0.0.1")
|
||||
return "127.0.0.1"
|
||||
|
||||
|
||||
def _read_csv_env(raw: str | None) -> list[str]:
|
||||
if not raw:
|
||||
return []
|
||||
return [x.strip() for x in raw.split(",") if x.strip()]
|
||||
|
||||
|
||||
def _read_cidrs_env(raw: str | None) -> list[ipaddress.IPv4Network]:
|
||||
cidrs: list[ipaddress.IPv4Network] = []
|
||||
for item in _read_csv_env(raw):
|
||||
try:
|
||||
net = ipaddress.ip_network(item, strict=False)
|
||||
if isinstance(net, (ipaddress.IPv4Network,)):
|
||||
cidrs.append(net)
|
||||
except ValueError:
|
||||
LOGGER.warning("Ignoring invalid CIDR in AGENT_PREFER_NET_CIDRS", extra={"cidr": item})
|
||||
return cidrs
|
||||
|
||||
|
||||
def _list_global_ipv4_addrs() -> list[tuple[str, str]]:
|
||||
"""列出 (iface, ip) 形式的全局 IPv4 地址。
|
||||
依赖 iproute2:ip -4 -o addr show scope global
|
||||
"""
|
||||
results: list[tuple[str, str]] = []
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
["sh", "-lc", "ip -4 -o addr show scope global | awk '{print $2, $4}'"],
|
||||
check=False,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
timeout=3,
|
||||
)
|
||||
if proc.returncode == 0:
|
||||
for line in proc.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
parts = line.split()
|
||||
if len(parts) != 2:
|
||||
continue
|
||||
iface, cidr = parts
|
||||
ip = cidr.split("/")[0]
|
||||
try:
|
||||
ipaddress.IPv4Address(ip)
|
||||
except ValueError:
|
||||
continue
|
||||
results.append((iface, ip))
|
||||
except Exception as exc: # pragma: no cover - defensive
|
||||
LOGGER.debug("Failed to list interfaces", extra={"error": str(exc)})
|
||||
return results
|
||||
|
||||
|
||||
def _resolve_hostname_ips(name: str) -> list[str]:
|
||||
ips: list[str] = []
|
||||
try:
|
||||
infos = socket.getaddrinfo(name, None, family=socket.AF_INET)
|
||||
for info in infos:
|
||||
ip = info[4][0]
|
||||
if ip not in ips:
|
||||
ips.append(ip)
|
||||
except OSError:
|
||||
pass
|
||||
return ips
|
||||
|
||||
|
||||
def _pick_by_cidrs(candidates: list[str], prefer_cidrs: list[ipaddress.IPv4Network]) -> str | None:
|
||||
for net in prefer_cidrs:
|
||||
for ip in candidates:
|
||||
try:
|
||||
if ipaddress.ip_address(ip) in net:
|
||||
return ip
|
||||
except ValueError:
|
||||
continue
|
||||
return None
|
||||
|
||||
|
||||
def _select_publish_ips(
|
||||
*,
|
||||
interfaces: list[tuple[str, str]],
|
||||
host_ips: list[str],
|
||||
prefer_cidrs: list[ipaddress.IPv4Network],
|
||||
) -> tuple[str, str | None, str | None]:
|
||||
"""返回 (selected_ip, overlay_ip, gwbridge_ip)。
|
||||
|
||||
- overlay_ip:优先命中 prefer_cidrs(10.0/8 先于 172.31/16)。
|
||||
- gwbridge_ip:若存在 172.22/16 则记录。
|
||||
- selected_ip:优先 AGENT_PUBLISH_IP;否则 overlay_ip;否则 hostname A 记录中的 prefer;否则默认路由回退。
|
||||
"""
|
||||
# detect gwbridge (172.22/16)
|
||||
gwbridge_net = ipaddress.ip_network("172.22.0.0/16")
|
||||
gwbridge_ip = None
|
||||
for _, ip in interfaces:
|
||||
try:
|
||||
if ipaddress.ip_address(ip) in gwbridge_net:
|
||||
gwbridge_ip = ip
|
||||
break
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
# overlay candidate from interfaces by prefer cidrs
|
||||
iface_ips = [ip for _, ip in interfaces]
|
||||
overlay_ip = _pick_by_cidrs(iface_ips, prefer_cidrs)
|
||||
|
||||
# hostname A records filtered by prefer cidrs
|
||||
host_pref = _pick_by_cidrs(host_ips, prefer_cidrs)
|
||||
|
||||
env_ip = os.environ.get("AGENT_PUBLISH_IP")
|
||||
if env_ip:
|
||||
selected = env_ip
|
||||
else:
|
||||
selected = overlay_ip or host_pref or _detect_ip_address()
|
||||
|
||||
return selected, overlay_ip, gwbridge_ip
|
||||
|
||||
@ -6,21 +6,14 @@ from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Final
|
||||
|
||||
from .state import load_node_state
|
||||
from .version import VERSION
|
||||
from .log import get_logger
|
||||
|
||||
DEFAULT_REPORT_INTERVAL_SECONDS: Final[int] = 60
|
||||
|
||||
LOGGER = get_logger("argus.agent.config")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class AgentConfig:
|
||||
hostname: str
|
||||
environment: str
|
||||
user: str
|
||||
instance: str
|
||||
node_file: str
|
||||
version: str
|
||||
master_endpoint: str
|
||||
@ -54,68 +47,11 @@ def _resolve_hostname() -> str:
|
||||
return os.environ.get("AGENT_HOSTNAME") or socket.gethostname()
|
||||
|
||||
|
||||
def _load_metadata_from_state(node_file: str) -> tuple[str, str, str] | None:
|
||||
state = load_node_state(node_file)
|
||||
if not state:
|
||||
return None
|
||||
|
||||
meta = state.get("meta_data") or {}
|
||||
env = meta.get("env") or state.get("env")
|
||||
user = meta.get("user") or state.get("user")
|
||||
instance = meta.get("instance") or state.get("instance")
|
||||
|
||||
if env and user and instance:
|
||||
LOGGER.debug("Metadata resolved from node state", extra={"node_file": node_file})
|
||||
return env, user, instance
|
||||
|
||||
LOGGER.warning(
|
||||
"node.json missing metadata fields; ignoring",
|
||||
extra={"node_file": node_file, "meta_data": meta},
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
def _resolve_metadata_fields(hostname: str, node_file: str) -> tuple[str, str, str]:
|
||||
env = os.environ.get("AGENT_ENV")
|
||||
user = os.environ.get("AGENT_USER")
|
||||
instance = os.environ.get("AGENT_INSTANCE")
|
||||
|
||||
if env and user and instance:
|
||||
return env, user, instance
|
||||
|
||||
if any([env, user, instance]):
|
||||
LOGGER.warning(
|
||||
"Incomplete metadata environment variables; falling back to persisted metadata",
|
||||
extra={
|
||||
"has_env": bool(env),
|
||||
"has_user": bool(user),
|
||||
"has_instance": bool(instance),
|
||||
},
|
||||
)
|
||||
|
||||
state_metadata = _load_metadata_from_state(node_file)
|
||||
if state_metadata is not None:
|
||||
return state_metadata
|
||||
|
||||
from .collector import _parse_hostname # Local import to avoid circular dependency
|
||||
|
||||
env, user, instance = _parse_hostname(hostname)
|
||||
|
||||
if not all([env, user, instance]):
|
||||
raise ValueError(
|
||||
"Failed to determine metadata fields; set AGENT_ENV/USER/INSTANCE or use supported hostname pattern"
|
||||
)
|
||||
|
||||
return env, user, instance
|
||||
|
||||
|
||||
def load_config() -> AgentConfig:
|
||||
"""从环境变量推导配置,移除了外部配置文件依赖。"""
|
||||
|
||||
hostname = _resolve_hostname()
|
||||
node_file = f"/private/argus/agent/{hostname}/node.json"
|
||||
environment, user, instance = _resolve_metadata_fields(hostname, node_file)
|
||||
|
||||
health_dir = f"/private/argus/agent/{hostname}/health/"
|
||||
|
||||
master_endpoint_env = os.environ.get("MASTER_ENDPOINT")
|
||||
@ -130,9 +66,6 @@ def load_config() -> AgentConfig:
|
||||
|
||||
return AgentConfig(
|
||||
hostname=hostname,
|
||||
environment=environment,
|
||||
user=user,
|
||||
instance=instance,
|
||||
node_file=node_file,
|
||||
version=VERSION,
|
||||
master_endpoint=master_endpoint,
|
||||
|
||||
BIN
src/agent/dist/argus-agent
vendored
Executable file
BIN
src/agent/dist/argus-agent
vendored
Executable file
Binary file not shown.
@ -12,8 +12,6 @@ VENV_DIR="$BUILD_ROOT/venv"
|
||||
|
||||
AGENT_BUILD_IMAGE="${AGENT_BUILD_IMAGE:-python:3.11-slim-bullseye}"
|
||||
AGENT_BUILD_USE_DOCKER="${AGENT_BUILD_USE_DOCKER:-1}"
|
||||
# 默认在容器内忽略代理以避免公司内网代理在 Docker 网络不可达导致 pip 失败(可用 0 关闭)
|
||||
AGENT_BUILD_IGNORE_PROXY="${AGENT_BUILD_IGNORE_PROXY:-1}"
|
||||
USED_DOCKER=0
|
||||
|
||||
run_host_build() {
|
||||
@ -73,7 +71,6 @@ run_docker_build() {
|
||||
pass_env_if_set http_proxy
|
||||
pass_env_if_set https_proxy
|
||||
pass_env_if_set no_proxy
|
||||
pass_env_if_set AGENT_BUILD_IGNORE_PROXY
|
||||
|
||||
build_script=$(cat <<'INNER'
|
||||
set -euo pipefail
|
||||
@ -85,10 +82,6 @@ rm -rf build dist
|
||||
mkdir -p build/pyinstaller dist
|
||||
python3 -m venv --copies build/venv
|
||||
source build/venv/bin/activate
|
||||
# 若指定忽略代理,则清空常见代理与 pip 镜像环境变量,避免容器内代理不可达
|
||||
if [ "${AGENT_BUILD_IGNORE_PROXY:-1}" = "1" ]; then
|
||||
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY PIP_INDEX_URL PIP_EXTRA_INDEX_URL PIP_TRUSTED_HOST
|
||||
fi
|
||||
pip install --upgrade pip
|
||||
pip install .
|
||||
pip install pyinstaller==6.6.0
|
||||
|
||||
@ -60,36 +60,6 @@ services:
|
||||
ipv4_address: 172.28.0.20
|
||||
restart: always
|
||||
|
||||
agent_env:
|
||||
image: ubuntu:22.04
|
||||
container_name: argus-agent-env-e2e
|
||||
hostname: host_abc
|
||||
depends_on:
|
||||
- master
|
||||
- bind
|
||||
environment:
|
||||
- MASTER_ENDPOINT=http://master.argus.com:3000
|
||||
- REPORT_INTERVAL_SECONDS=2
|
||||
- AGENT_ENV=prod
|
||||
- AGENT_USER=ml
|
||||
- AGENT_INSTANCE=node-3
|
||||
- AGENT_HOSTNAME=host_abc
|
||||
- "ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}"
|
||||
- "ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}"
|
||||
volumes:
|
||||
- ./private/argus/agent/host_abc:/private/argus/agent/host_abc
|
||||
- ./private/argus/agent/host_abc/health:/private/argus/agent/host_abc/health
|
||||
- ./private/argus/etc:/private/argus/etc
|
||||
- ../dist/argus-agent:/usr/local/bin/argus-agent:ro
|
||||
- ./scripts/agent_entrypoint.sh:/usr/local/bin/agent-entrypoint.sh:ro
|
||||
- ../scripts/agent_deployment_verify.sh:/usr/local/bin/agent_deployment_verify.sh:ro
|
||||
entrypoint:
|
||||
- /usr/local/bin/agent-entrypoint.sh
|
||||
networks:
|
||||
default:
|
||||
ipv4_address: 172.28.0.21
|
||||
restart: always
|
||||
|
||||
networks:
|
||||
default:
|
||||
driver: bridge
|
||||
|
||||
@ -7,10 +7,10 @@ SCRIPTS=(
|
||||
"02_up.sh"
|
||||
"03_wait_and_assert_registration.sh"
|
||||
"04_write_health_files.sh"
|
||||
"05_verify_agent.sh"
|
||||
"06_assert_status_on_master.sh"
|
||||
"07_restart_agent_and_reregister.sh"
|
||||
"08_down.sh"
|
||||
"08_verify_agent.sh"
|
||||
"05_assert_status_on_master.sh"
|
||||
"06_restart_agent_and_reregister.sh"
|
||||
"07_down.sh"
|
||||
)
|
||||
|
||||
for script in "${SCRIPTS[@]}"; do
|
||||
|
||||
@ -41,7 +41,7 @@ compose() {
|
||||
fi
|
||||
}
|
||||
|
||||
docker container rm -f argus-agent-e2e argus-agent-env-e2e argus-master-agent-e2e argus-bind-agent-e2e >/dev/null 2>&1 || true
|
||||
docker container rm -f argus-agent-e2e argus-master-agent-e2e argus-bind-agent-e2e >/dev/null 2>&1 || true
|
||||
|
||||
docker network rm tests_default >/dev/null 2>&1 || true
|
||||
|
||||
|
||||
@ -6,14 +6,11 @@ TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
TMP_ROOT="$TEST_ROOT/tmp"
|
||||
API_BASE="http://localhost:32300/api/v1/master"
|
||||
AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
|
||||
ENV_AGENT_HOSTNAME="host_abc"
|
||||
NODE_FILE="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME/node.json"
|
||||
ENV_NODE_FILE="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME/node.json"
|
||||
|
||||
mkdir -p "$TMP_ROOT"
|
||||
|
||||
primary_node_id=""
|
||||
env_node_id=""
|
||||
node_id=""
|
||||
for _ in {1..30}; do
|
||||
sleep 2
|
||||
response=$(curl -sS "$API_BASE/nodes" || true)
|
||||
@ -22,49 +19,24 @@ for _ in {1..30}; do
|
||||
fi
|
||||
list_file="$TMP_ROOT/nodes_list.json"
|
||||
echo "$response" > "$list_file"
|
||||
readarray -t node_ids < <(python3 - "$list_file" "$AGENT_HOSTNAME" "$ENV_AGENT_HOSTNAME" <<'PY'
|
||||
node_id=$(python3 - "$list_file" <<'PY'
|
||||
import json, sys
|
||||
|
||||
with open(sys.argv[1]) as handle:
|
||||
nodes = json.load(handle)
|
||||
|
||||
target_primary = sys.argv[2]
|
||||
target_env = sys.argv[3]
|
||||
|
||||
primary_id = ""
|
||||
env_id = ""
|
||||
|
||||
for node in nodes:
|
||||
if node.get("name") == target_primary:
|
||||
primary_id = node.get("id", "")
|
||||
if node.get("name") == target_env:
|
||||
env_id = node.get("id", "")
|
||||
|
||||
print(primary_id)
|
||||
print(env_id)
|
||||
print(nodes[0]["id"] if nodes else "")
|
||||
PY
|
||||
)
|
||||
|
||||
primary_node_id="${node_ids[0]}"
|
||||
env_node_id="${node_ids[1]}"
|
||||
|
||||
if [[ -n "$primary_node_id" && -n "$env_node_id" ]]; then
|
||||
)
|
||||
if [[ -n "$node_id" ]]; then
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ -z "$primary_node_id" ]]; then
|
||||
echo "[ERROR] Primary agent did not register within timeout" >&2
|
||||
if [[ -z "$node_id" ]]; then
|
||||
echo "[ERROR] Agent did not register within timeout" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ -z "$env_node_id" ]]; then
|
||||
echo "[ERROR] Env-variable agent did not register within timeout" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$primary_node_id" > "$TMP_ROOT/node_id"
|
||||
echo "$env_node_id" > "$TMP_ROOT/node_id_host_abc"
|
||||
echo "$node_id" > "$TMP_ROOT/node_id"
|
||||
|
||||
if [[ ! -f "$NODE_FILE" ]]; then
|
||||
echo "[ERROR] node.json not created at $NODE_FILE" >&2
|
||||
@ -78,20 +50,8 @@ with open(sys.argv[1]) as handle:
|
||||
assert "id" in node and node["id"], "node.json missing id"
|
||||
PY
|
||||
|
||||
if [[ ! -f "$ENV_NODE_FILE" ]]; then
|
||||
echo "[ERROR] node.json not created at $ENV_NODE_FILE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
python3 - "$ENV_NODE_FILE" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1]) as handle:
|
||||
node = json.load(handle)
|
||||
assert "id" in node and node["id"], "env agent node.json missing id"
|
||||
PY
|
||||
|
||||
detail_file="$TMP_ROOT/initial_detail.json"
|
||||
curl -sS "$API_BASE/nodes/$primary_node_id" -o "$detail_file"
|
||||
curl -sS "$API_BASE/nodes/$node_id" -o "$detail_file"
|
||||
python3 - "$detail_file" "$TMP_ROOT/initial_ip" <<'PY'
|
||||
import json, sys, pathlib
|
||||
with open(sys.argv[1]) as handle:
|
||||
@ -102,5 +62,4 @@ if not ip:
|
||||
pathlib.Path(sys.argv[2]).write_text(ip)
|
||||
PY
|
||||
|
||||
echo "[INFO] Agent registered with node id $primary_node_id"
|
||||
echo "[INFO] Env-variable agent registered with node id $env_node_id"
|
||||
echo "[INFO] Agent registered with node id $node_id"
|
||||
|
||||
@ -6,8 +6,6 @@ TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
TMP_ROOT="$TEST_ROOT/tmp"
|
||||
API_BASE="http://localhost:32300/api/v1/master"
|
||||
NODE_ID="$(cat "$TMP_ROOT/node_id")"
|
||||
ENV_NODE_ID="$(cat "$TMP_ROOT/node_id_host_abc")"
|
||||
ENV_HOSTNAME="host_abc"
|
||||
NODES_JSON="$TEST_ROOT/private/argus/metric/prometheus/nodes.json"
|
||||
|
||||
success=false
|
||||
@ -43,36 +41,13 @@ if [[ ! -f "$NODES_JSON" ]]; then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
python3 - "$NODES_JSON" "$NODE_ID" "$ENV_NODE_ID" <<'PY'
|
||||
python3 - "$NODES_JSON" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1]) as handle:
|
||||
nodes = json.load(handle)
|
||||
|
||||
expected_primary = sys.argv[2]
|
||||
expected_env = sys.argv[3]
|
||||
|
||||
ids = {entry.get("node_id") for entry in nodes}
|
||||
assert expected_primary in ids, nodes
|
||||
assert expected_env in ids, nodes
|
||||
assert len(nodes) >= 2, nodes
|
||||
assert len(nodes) == 1, nodes
|
||||
entry = nodes[0]
|
||||
assert entry["node_id"], entry
|
||||
PY
|
||||
|
||||
echo "[INFO] Master reflects agent health and nodes.json entries"
|
||||
|
||||
env_detail_file="$TMP_ROOT/env_agent_detail.json"
|
||||
curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_detail_file"
|
||||
python3 - "$env_detail_file" "$ENV_HOSTNAME" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1]) as handle:
|
||||
node = json.load(handle)
|
||||
|
||||
expected_name = sys.argv[2]
|
||||
|
||||
assert node.get("name") == expected_name, node
|
||||
meta = node.get("meta_data", {})
|
||||
assert meta.get("env") == "prod", meta
|
||||
assert meta.get("user") == "ml", meta
|
||||
assert meta.get("instance") == "node-3", meta
|
||||
PY
|
||||
|
||||
echo "[INFO] Env-variable agent reports expected metadata"
|
||||
@ -1,60 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
REPO_ROOT="$(cd "$TEST_ROOT/.." && pwd)"
|
||||
VERIFY_SCRIPT="$REPO_ROOT/scripts/agent_deployment_verify.sh"
|
||||
ENV_NODE_ID_FILE="$TEST_ROOT/tmp/node_id_host_abc"
|
||||
PRIMARY_CONTAINER="argus-agent-e2e"
|
||||
ENV_CONTAINER="argus-agent-env-e2e"
|
||||
PRIMARY_HOSTNAME="dev-e2euser-e2einst-pod-0"
|
||||
ENV_HOSTNAME="host_abc"
|
||||
|
||||
if ! docker ps --format '{{.Names}}' | grep -q "^${PRIMARY_CONTAINER}$"; then
|
||||
echo "[WARN] agent container not running; skip verification"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if docker exec -i "$PRIMARY_CONTAINER" bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
|
||||
echo "[INFO] curl/jq already installed in agent container"
|
||||
else
|
||||
echo "[INFO] Installing curl/jq in agent container"
|
||||
docker exec -i "$PRIMARY_CONTAINER" bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
|
||||
fi
|
||||
|
||||
if [[ ! -f "$VERIFY_SCRIPT" ]]; then
|
||||
echo "[ERROR] Verification script missing at $VERIFY_SCRIPT" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
run_verifier() {
|
||||
local container="$1" hostname="$2"
|
||||
|
||||
if ! docker ps --format '{{.Names}}' | grep -q "^${container}$"; then
|
||||
echo "[WARN] container $container not running; skip"
|
||||
return
|
||||
fi
|
||||
|
||||
if ! docker exec -i "$container" bash -lc 'command -v /usr/local/bin/agent_deployment_verify.sh >/dev/null'; then
|
||||
echo "[ERROR] /usr/local/bin/agent_deployment_verify.sh missing in $container" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "[INFO] Running verification for $hostname in $container"
|
||||
docker exec -i "$container" env VERIFY_HOSTNAME="$hostname" /usr/local/bin/agent_deployment_verify.sh
|
||||
}
|
||||
|
||||
run_verifier "$PRIMARY_CONTAINER" "$PRIMARY_HOSTNAME"
|
||||
|
||||
if docker ps --format '{{.Names}}' | grep -q "^${ENV_CONTAINER}$"; then
|
||||
if docker exec -i "$ENV_CONTAINER" bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
|
||||
echo "[INFO] curl/jq already installed in env agent container"
|
||||
else
|
||||
echo "[INFO] Installing curl/jq in env agent container"
|
||||
docker exec -i "$ENV_CONTAINER" bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
|
||||
fi
|
||||
run_verifier "$ENV_CONTAINER" "$ENV_HOSTNAME"
|
||||
else
|
||||
echo "[WARN] env-driven agent container not running; skip secondary verification"
|
||||
fi
|
||||
@ -6,20 +6,10 @@ TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
TMP_ROOT="$TEST_ROOT/tmp"
|
||||
API_BASE="http://localhost:32300/api/v1/master"
|
||||
NODE_ID="$(cat "$TMP_ROOT/node_id")"
|
||||
ENV_NODE_ID_FILE="$TMP_ROOT/node_id_host_abc"
|
||||
if [[ ! -f "$ENV_NODE_ID_FILE" ]]; then
|
||||
echo "[ERROR] Env agent node id file missing at $ENV_NODE_ID_FILE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
ENV_NODE_ID="$(cat "$ENV_NODE_ID_FILE")"
|
||||
AGENT_HOSTNAME="dev-e2euser-e2einst-pod-0"
|
||||
ENV_AGENT_HOSTNAME="host_abc"
|
||||
NETWORK_NAME="tests_default"
|
||||
NEW_AGENT_IP="172.28.0.200"
|
||||
NEW_ENV_AGENT_IP="172.28.0.210"
|
||||
ENTRYPOINT_SCRIPT="$SCRIPT_DIR/agent_entrypoint.sh"
|
||||
VERIFY_SCRIPT="$TEST_ROOT/../scripts/agent_deployment_verify.sh"
|
||||
ENV_FILE="$TEST_ROOT/.env"
|
||||
|
||||
# 中文提示:重启场景也需要同样的入口脚本,确保 DNS 注册逻辑一致
|
||||
@ -28,11 +18,6 @@ if [[ ! -f "$ENTRYPOINT_SCRIPT" ]]; then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "$VERIFY_SCRIPT" ]]; then
|
||||
echo "[ERROR] agent verification script missing at $VERIFY_SCRIPT" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "$TMP_ROOT/agent_binary_path" ]]; then
|
||||
echo "[ERROR] Agent binary path missing; rerun bootstrap" >&2
|
||||
exit 1
|
||||
@ -89,37 +74,15 @@ if [[ "$prev_ip" != "$initial_ip" ]]; then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
env_before_file="$TMP_ROOT/env_before_restart.json"
|
||||
curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_before_file"
|
||||
env_prev_last_updated=$(python3 - "$env_before_file" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1]) as handle:
|
||||
node = json.load(handle)
|
||||
print(node.get("last_updated", ""))
|
||||
PY
|
||||
)
|
||||
env_prev_ip=$(python3 - "$env_before_file" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1]) as handle:
|
||||
node = json.load(handle)
|
||||
print(node["meta_data"].get("ip", ""))
|
||||
PY
|
||||
)
|
||||
|
||||
pushd "$TEST_ROOT" >/dev/null
|
||||
compose rm -sf agent
|
||||
compose rm -sf agent_env
|
||||
popd >/dev/null
|
||||
|
||||
docker container rm -f argus-agent-e2e >/dev/null 2>&1 || true
|
||||
docker container rm -f argus-agent-env-e2e >/dev/null 2>&1 || true
|
||||
|
||||
AGENT_DIR="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME"
|
||||
HEALTH_DIR="$TEST_ROOT/private/argus/agent/$AGENT_HOSTNAME/health"
|
||||
|
||||
ENV_AGENT_DIR="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME"
|
||||
ENV_HEALTH_DIR="$TEST_ROOT/private/argus/agent/$ENV_AGENT_HOSTNAME/health"
|
||||
|
||||
# 先以 sleep 方式启动容器,确保我们掌握注册时的网络状态
|
||||
if ! docker run -d \
|
||||
--name argus-agent-e2e \
|
||||
@ -131,7 +94,6 @@ if ! docker run -d \
|
||||
-v "$TEST_ROOT/private/argus/etc:/private/argus/etc" \
|
||||
-v "$AGENT_BINARY:/usr/local/bin/argus-agent:ro" \
|
||||
-v "$ENTRYPOINT_SCRIPT:/usr/local/bin/agent-entrypoint.sh:ro" \
|
||||
-v "$VERIFY_SCRIPT:/usr/local/bin/agent_deployment_verify.sh:ro" \
|
||||
-e MASTER_ENDPOINT=http://master.argus.com:3000 \
|
||||
-e REPORT_INTERVAL_SECONDS=2 \
|
||||
-e ARGUS_BUILD_UID="$AGENT_UID" \
|
||||
@ -179,76 +141,3 @@ if [[ "$success" != true ]]; then
|
||||
fi
|
||||
|
||||
echo "[INFO] Agent restart produced successful re-registration with IP change"
|
||||
|
||||
# ---- Restart env-driven agent without metadata environment variables ----
|
||||
|
||||
if [[ ! -d "$ENV_AGENT_DIR" ]]; then
|
||||
echo "[ERROR] Env agent data dir missing at $ENV_AGENT_DIR" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -d "$ENV_HEALTH_DIR" ]]; then
|
||||
mkdir -p "$ENV_HEALTH_DIR"
|
||||
fi
|
||||
|
||||
if ! docker run -d \
|
||||
--name argus-agent-env-e2e \
|
||||
--hostname "$ENV_AGENT_HOSTNAME" \
|
||||
--network "$NETWORK_NAME" \
|
||||
--ip "$NEW_ENV_AGENT_IP" \
|
||||
-v "$ENV_AGENT_DIR:/private/argus/agent/$ENV_AGENT_HOSTNAME" \
|
||||
-v "$ENV_HEALTH_DIR:/private/argus/agent/$ENV_AGENT_HOSTNAME/health" \
|
||||
-v "$TEST_ROOT/private/argus/etc:/private/argus/etc" \
|
||||
-v "$AGENT_BINARY:/usr/local/bin/argus-agent:ro" \
|
||||
-v "$ENTRYPOINT_SCRIPT:/usr/local/bin/agent-entrypoint.sh:ro" \
|
||||
-v "$VERIFY_SCRIPT:/usr/local/bin/agent_deployment_verify.sh:ro" \
|
||||
-e MASTER_ENDPOINT=http://master.argus.com:3000 \
|
||||
-e REPORT_INTERVAL_SECONDS=2 \
|
||||
-e ARGUS_BUILD_UID="$AGENT_UID" \
|
||||
-e ARGUS_BUILD_GID="$AGENT_GID" \
|
||||
--entrypoint /usr/local/bin/agent-entrypoint.sh \
|
||||
ubuntu:22.04 >/dev/null; then
|
||||
echo "[ERROR] Failed to start env-driven agent container without metadata env" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
env_success=false
|
||||
env_detail_file="$TMP_ROOT/env_post_restart.json"
|
||||
for _ in {1..20}; do
|
||||
sleep 3
|
||||
if ! curl -sS "$API_BASE/nodes/$ENV_NODE_ID" -o "$env_detail_file"; then
|
||||
continue
|
||||
fi
|
||||
if python3 - "$env_detail_file" "$env_prev_last_updated" "$ENV_NODE_ID" "$env_prev_ip" "$NEW_ENV_AGENT_IP" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1]) as handle:
|
||||
node = json.load(handle)
|
||||
prev_last_updated = sys.argv[2]
|
||||
expected_id = sys.argv[3]
|
||||
old_ip = sys.argv[4]
|
||||
expected_ip = sys.argv[5]
|
||||
last_updated = node.get("last_updated")
|
||||
current_ip = node["meta_data"].get("ip")
|
||||
meta = node.get("meta_data", {})
|
||||
assert node["id"] == expected_id
|
||||
if current_ip != expected_ip:
|
||||
raise SystemExit(1)
|
||||
if current_ip == old_ip:
|
||||
raise SystemExit(1)
|
||||
if not last_updated or last_updated == prev_last_updated:
|
||||
raise SystemExit(1)
|
||||
if meta.get("env") != "prod" or meta.get("user") != "ml" or meta.get("instance") != "node-3":
|
||||
raise SystemExit(1)
|
||||
PY
|
||||
then
|
||||
env_success=true
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ "$env_success" != true ]]; then
|
||||
echo "[ERROR] Env-driven agent did not reuse persisted metadata after restart" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "[INFO] Env-driven agent restart succeeded with persisted metadata"
|
||||
@ -13,7 +13,7 @@ compose() {
|
||||
fi
|
||||
}
|
||||
|
||||
docker container rm -f argus-agent-e2e argus-agent-env-e2e >/dev/null 2>&1 || true
|
||||
docker container rm -f argus-agent-e2e >/dev/null 2>&1 || true
|
||||
|
||||
pushd "$TEST_ROOT" >/dev/null
|
||||
compose down --remove-orphans
|
||||
26
src/agent/tests/scripts/08_verify_agent.sh
Executable file
26
src/agent/tests/scripts/08_verify_agent.sh
Executable file
@ -0,0 +1,26 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
TEST_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
VERIFY_SCRIPT="$(cd "$TEST_ROOT/.." && pwd)/scripts/agent_deployment_verify.sh"
|
||||
|
||||
if ! docker ps --format '{{.Names}}' | grep -q '^argus-agent-e2e$'; then
|
||||
echo "[WARN] agent container not running; skip verification"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if docker exec -i argus-agent-e2e bash -lc 'command -v curl >/dev/null && command -v jq >/dev/null'; then
|
||||
echo "[INFO] curl/jq already installed in agent container"
|
||||
else
|
||||
echo "[INFO] Installing curl/jq in agent container"
|
||||
docker exec -i argus-agent-e2e bash -lc 'apt-get update >/dev/null 2>&1 && apt-get install -y curl jq >/dev/null 2>&1' || true
|
||||
fi
|
||||
|
||||
if docker exec -i argus-agent-e2e bash -lc 'command -v /usr/local/bin/agent_deployment_verify.sh >/dev/null'; then
|
||||
docker exec -i argus-agent-e2e /usr/local/bin/agent_deployment_verify.sh
|
||||
elif [[ -x "$VERIFY_SCRIPT" ]]; then
|
||||
docker exec -i argus-agent-e2e "$VERIFY_SCRIPT"
|
||||
else
|
||||
echo "[WARN] agent_deployment_verify.sh not found"
|
||||
fi
|
||||
@ -1,151 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import unittest
|
||||
from contextlib import contextmanager
|
||||
from unittest.mock import patch
|
||||
|
||||
from app.config import AgentConfig, load_config
|
||||
|
||||
|
||||
@contextmanager
|
||||
def temp_env(**overrides: str | None):
|
||||
originals: dict[str, str | None] = {}
|
||||
try:
|
||||
for key, value in overrides.items():
|
||||
originals[key] = os.environ.get(key)
|
||||
if value is None:
|
||||
os.environ.pop(key, None)
|
||||
else:
|
||||
os.environ[key] = value
|
||||
yield
|
||||
finally:
|
||||
for key, original in originals.items():
|
||||
if original is None:
|
||||
os.environ.pop(key, None)
|
||||
else:
|
||||
os.environ[key] = original
|
||||
|
||||
|
||||
class LoadConfigMetadataTests(unittest.TestCase):
|
||||
@patch("app.config.Path.mkdir")
|
||||
def test_metadata_from_environment_variables(self, mock_mkdir):
|
||||
with temp_env(
|
||||
MASTER_ENDPOINT="http://master.local",
|
||||
AGENT_HOSTNAME="dev-user-one-pod",
|
||||
AGENT_ENV="prod",
|
||||
AGENT_USER="ops",
|
||||
AGENT_INSTANCE="node-1",
|
||||
):
|
||||
config = load_config()
|
||||
|
||||
self.assertEqual(config.environment, "prod")
|
||||
self.assertEqual(config.user, "ops")
|
||||
self.assertEqual(config.instance, "node-1")
|
||||
mock_mkdir.assert_called()
|
||||
|
||||
@patch("app.config.Path.mkdir")
|
||||
def test_metadata_falls_back_to_hostname(self, mock_mkdir):
|
||||
with temp_env(
|
||||
MASTER_ENDPOINT="http://master.local",
|
||||
AGENT_HOSTNAME="qa-team-abc-pod-2",
|
||||
AGENT_ENV=None,
|
||||
AGENT_USER=None,
|
||||
AGENT_INSTANCE=None,
|
||||
):
|
||||
config = load_config()
|
||||
|
||||
self.assertEqual(config.environment, "qa")
|
||||
self.assertEqual(config.user, "team")
|
||||
self.assertEqual(config.instance, "abc")
|
||||
mock_mkdir.assert_called()
|
||||
|
||||
@patch("app.config._load_metadata_from_state", return_value=("prod", "ops", "node-1"))
|
||||
@patch("app.config.Path.mkdir")
|
||||
def test_metadata_from_node_state(self, mock_mkdir, mock_state):
|
||||
with temp_env(
|
||||
MASTER_ENDPOINT="http://master.local",
|
||||
AGENT_HOSTNAME="host_abc",
|
||||
AGENT_ENV=None,
|
||||
AGENT_USER=None,
|
||||
AGENT_INSTANCE=None,
|
||||
):
|
||||
config = load_config()
|
||||
|
||||
self.assertEqual(config.environment, "prod")
|
||||
self.assertEqual(config.user, "ops")
|
||||
self.assertEqual(config.instance, "node-1")
|
||||
mock_state.assert_called_once()
|
||||
mock_mkdir.assert_called()
|
||||
|
||||
@patch("app.config.Path.mkdir")
|
||||
def test_partial_environment_variables_fallback(self, mock_mkdir):
|
||||
with temp_env(
|
||||
MASTER_ENDPOINT="http://master.local",
|
||||
AGENT_HOSTNAME="stage-ml-001-node",
|
||||
AGENT_ENV="prod",
|
||||
AGENT_USER=None,
|
||||
AGENT_INSTANCE=None,
|
||||
):
|
||||
config = load_config()
|
||||
|
||||
self.assertEqual(config.environment, "stage")
|
||||
self.assertEqual(config.user, "ml")
|
||||
self.assertEqual(config.instance, "001")
|
||||
mock_mkdir.assert_called()
|
||||
|
||||
@patch("app.config.Path.mkdir")
|
||||
def test_invalid_hostname_raises_error(self, mock_mkdir):
|
||||
with temp_env(
|
||||
MASTER_ENDPOINT="http://master.local",
|
||||
AGENT_HOSTNAME="invalidhostname",
|
||||
AGENT_ENV=None,
|
||||
AGENT_USER=None,
|
||||
AGENT_INSTANCE=None,
|
||||
):
|
||||
with self.assertRaises(ValueError):
|
||||
load_config()
|
||||
|
||||
mock_mkdir.assert_not_called()
|
||||
|
||||
|
||||
class CollectMetadataTests(unittest.TestCase):
|
||||
@patch("app.collector._detect_ip_address", return_value="127.0.0.1")
|
||||
@patch("app.collector._detect_gpu_count", return_value=0)
|
||||
@patch("app.collector._detect_memory_bytes", return_value=1024)
|
||||
@patch("app.collector._detect_cpu_count", return_value=8)
|
||||
def test_collect_metadata_uses_config_fields(
|
||||
self,
|
||||
mock_cpu,
|
||||
mock_memory,
|
||||
mock_gpu,
|
||||
mock_ip,
|
||||
):
|
||||
config = AgentConfig(
|
||||
hostname="dev-user-001-pod",
|
||||
environment="prod",
|
||||
user="ops",
|
||||
instance="node-1",
|
||||
node_file="/tmp/node.json",
|
||||
version="1.0.0",
|
||||
master_endpoint="http://master.local",
|
||||
report_interval_seconds=60,
|
||||
health_dir="/tmp/health",
|
||||
)
|
||||
|
||||
from app.collector import collect_metadata
|
||||
|
||||
metadata = collect_metadata(config)
|
||||
|
||||
self.assertEqual(metadata["env"], "prod")
|
||||
self.assertEqual(metadata["user"], "ops")
|
||||
self.assertEqual(metadata["instance"], "node-1")
|
||||
self.assertEqual(metadata["hostname"], "dev-user-001-pod")
|
||||
self.assertEqual(metadata["ip"], "127.0.0.1")
|
||||
self.assertEqual(metadata["cpu_number"], 8)
|
||||
self.assertEqual(metadata["memory_in_bytes"], 1024)
|
||||
self.assertEqual(metadata["gpu_number"], 0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
@ -1,31 +0,0 @@
|
||||
# Alertmanager
|
||||
|
||||
## 构建
|
||||
1. 首先设置构建和部署的环境变量, 在项目根目录下执行:
|
||||
```bash
|
||||
cp src/alert/tests/.env.example src/alert/tests/.env
|
||||
```
|
||||
|
||||
然后找到复制出来的.env文件,修改环境变量。
|
||||
|
||||
2. 使用脚本构建,在项目根目录下执行:
|
||||
|
||||
```bash
|
||||
bash src/alert/alertmanager/build/build.sh
|
||||
```
|
||||
|
||||
构建成功后,会在项目根目录下生成argus-alertmanager-latest.tar
|
||||
|
||||
## 部署
|
||||
|
||||
提供docker-compose部署。在src/alert/tests目录下
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
## 动态配置
|
||||
配置文件放在`/private/argus/alert/alertmanager/alertmanager.yml`下,修改alertmanager.yml后,调用`http://alertmanager.alert.argus.com:9093/-/reload`接口(POST)可以重新加载配置.
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:9093/-/reload
|
||||
```
|
||||
@ -1,101 +0,0 @@
|
||||
# 基于 Ubuntu 24.04
|
||||
FROM ubuntu:24.04
|
||||
|
||||
# 切换到 root 用户
|
||||
USER root
|
||||
|
||||
# 安装必要依赖
|
||||
RUN apt-get update && \
|
||||
apt-get install -y wget supervisor net-tools inetutils-ping vim ca-certificates passwd && \
|
||||
apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# 设置 Alertmanager 版本(与本地离线包保持一致)
|
||||
ARG ALERTMANAGER_VERSION=0.28.1
|
||||
|
||||
# 使用仓库内预置的离线包构建(无需联网)
|
||||
COPY src/alert/alertmanager/build/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz /tmp/
|
||||
RUN tar xvf /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz -C /tmp && \
|
||||
mv /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64 /usr/local/alertmanager && \
|
||||
rm -f /tmp/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
|
||||
|
||||
ENV ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
|
||||
|
||||
ARG ARGUS_BUILD_UID=2133
|
||||
ARG ARGUS_BUILD_GID=2015
|
||||
ENV ARGUS_BUILD_UID=${ARGUS_BUILD_UID}
|
||||
ENV ARGUS_BUILD_GID=${ARGUS_BUILD_GID}
|
||||
|
||||
RUN mkdir -p /usr/share/alertmanager && \
|
||||
mkdir -p ${ALERTMANAGER_BASE_PATH} && \
|
||||
mkdir -p /private/argus/etc && \
|
||||
rm -rf /alertmanager && \
|
||||
ln -s ${ALERTMANAGER_BASE_PATH} /alertmanager
|
||||
|
||||
# 确保 ubuntu 账户存在并使用 ARGUS_BUILD_UID/GID
|
||||
RUN set -eux; \
|
||||
# 确保存在目标 GID 的组;若不存在则优先尝试将 ubuntu 组改为该 GID,否则创建新组
|
||||
if getent group "${ARGUS_BUILD_GID}" >/dev/null; then \
|
||||
:; \
|
||||
else \
|
||||
if getent group ubuntu >/dev/null; then \
|
||||
groupmod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
|
||||
else \
|
||||
groupadd -g "${ARGUS_BUILD_GID}" ubuntu || groupadd -g "${ARGUS_BUILD_GID}" argus || true; \
|
||||
fi; \
|
||||
fi; \
|
||||
# 创建或调整 ubuntu 用户
|
||||
if id ubuntu >/dev/null 2>&1; then \
|
||||
# 设置主组为目标 GID(可用 GID 数字指定)
|
||||
usermod -g "${ARGUS_BUILD_GID}" ubuntu || true; \
|
||||
# 若目标 UID 未被占用,则更新 ubuntu 的 UID
|
||||
if [ "$(id -u ubuntu)" != "${ARGUS_BUILD_UID}" ] && ! getent passwd "${ARGUS_BUILD_UID}" >/dev/null; then \
|
||||
usermod -u "${ARGUS_BUILD_UID}" ubuntu || true; \
|
||||
fi; \
|
||||
else \
|
||||
useradd -m -s /bin/bash -u "${ARGUS_BUILD_UID}" -g "${ARGUS_BUILD_GID}" ubuntu || true; \
|
||||
fi; \
|
||||
# 调整关键目录属主为 ubuntu UID/GID
|
||||
chown -R "${ARGUS_BUILD_UID}:${ARGUS_BUILD_GID}" /usr/share/alertmanager /alertmanager ${ALERTMANAGER_BASE_PATH} /private/argus/etc /usr/local/bin || true
|
||||
|
||||
# 配置内网 apt 源 (如果指定了内网选项)
|
||||
RUN if [ "$USE_INTRANET" = "true" ]; then \
|
||||
echo "Configuring intranet apt sources..." && \
|
||||
cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
|
||||
echo "deb [trusted=yes] http://10.68.64.1/ubuntu2204/ jammy main" > /etc/apt/sources.list && \
|
||||
echo 'Acquire::https::Verify-Peer "false";' > /etc/apt/apt.conf.d/99disable-ssl-check && \
|
||||
echo 'Acquire::https::Verify-Host "false";' >> /etc/apt/apt.conf.d/99disable-ssl-check; \
|
||||
fi
|
||||
|
||||
|
||||
# 配置部署时使用的 apt 源
|
||||
RUN if [ "$USE_INTRANET" = "true" ]; then \
|
||||
echo "deb [trusted=yes] https://10.92.132.52/mirrors/ubuntu2204/ jammy main" > /etc/apt/sources.list; \
|
||||
fi
|
||||
|
||||
# 创建 supervisor 日志目录
|
||||
RUN mkdir -p /var/log/supervisor
|
||||
|
||||
# 复制 supervisor 配置文件
|
||||
COPY src/alert/alertmanager/build/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
|
||||
|
||||
# 复制启动脚本
|
||||
COPY src/alert/alertmanager/build/start-am-supervised.sh /usr/local/bin/start-am-supervised.sh
|
||||
RUN chmod +x /usr/local/bin/start-am-supervised.sh
|
||||
|
||||
# 复制 Alertmanager 配置文件
|
||||
COPY src/alert/alertmanager/build/alertmanager.yml /etc/alertmanager/alertmanager.yml
|
||||
RUN chmod +x /etc/alertmanager/alertmanager.yml
|
||||
# COPY src/alert/alertmanager/build/alertmanager.yml ${ALERTMANAGER_BASE_PATH}/alertmanager.yml
|
||||
|
||||
# 复制 DNS 监控脚本
|
||||
COPY src/alert/alertmanager/build/dns-monitor.sh /usr/local/bin/dns-monitor.sh
|
||||
RUN chmod +x /usr/local/bin/dns-monitor.sh
|
||||
|
||||
# 保持 root 用户,由 supervisor 控制 user 切换
|
||||
USER root
|
||||
|
||||
# 暴露端口(Alertmanager 默认端口 9093)
|
||||
EXPOSE 9093
|
||||
|
||||
# 使用 supervisor 作为入口点
|
||||
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
|
||||
Binary file not shown.
@ -1,19 +0,0 @@
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
|
||||
route:
|
||||
group_by: ['alertname', 'instance'] # 分组:相同 alertname + instance 的告警合并
|
||||
group_wait: 30s # 第一个告警后,等 30s 看是否有同组告警一起发
|
||||
group_interval: 5m # 同组告警变化后,至少 5 分钟再发一次
|
||||
repeat_interval: 3h # 相同告警,3 小时重复提醒一次
|
||||
receiver: 'null'
|
||||
|
||||
receivers:
|
||||
- name: 'null'
|
||||
|
||||
inhibit_rules:
|
||||
- source_match:
|
||||
severity: 'critical' # critical 告警存在时
|
||||
target_match:
|
||||
severity: 'warning' # 抑制相同 instance 的 warning 告警
|
||||
equal: ['instance']
|
||||
@ -1,13 +0,0 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
docker pull ubuntu:24.04
|
||||
|
||||
source src/alert/tests/.env
|
||||
|
||||
docker build \
|
||||
--build-arg ARGUS_BUILD_UID=${ARGUS_BUILD_UID} \
|
||||
--build-arg ARGUS_BUILD_GID=${ARGUS_BUILD_GID} \
|
||||
-f src/alert/alertmanager/build/Dockerfile \
|
||||
-t argus-alertmanager:latest .
|
||||
|
||||
docker save -o argus-alertmanager-latest.tar argus-alertmanager:latest
|
||||
@ -1,68 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# DNS监控脚本 - 每10秒检查dns.conf是否有变化
|
||||
# 如果有变化则执行update-dns.sh脚本
|
||||
|
||||
DNS_CONF="/private/argus/etc/dns.conf"
|
||||
DNS_BACKUP="/tmp/dns.conf.backup"
|
||||
UPDATE_SCRIPT="/private/argus/etc/update-dns.sh"
|
||||
LOG_FILE="/var/log/supervisor/dns-monitor.log"
|
||||
|
||||
# 确保日志文件存在
|
||||
touch "$LOG_FILE"
|
||||
|
||||
log_message() {
|
||||
echo "$(date '+%Y-%m-%d %H:%M:%S') [DNS-Monitor] $1" >> "$LOG_FILE"
|
||||
}
|
||||
|
||||
log_message "DNS监控脚本启动"
|
||||
|
||||
while true; do
|
||||
if [ -f "$DNS_CONF" ]; then
|
||||
if [ -f "$DNS_BACKUP" ]; then
|
||||
# 比较文件内容
|
||||
if ! cmp -s "$DNS_CONF" "$DNS_BACKUP"; then
|
||||
log_message "检测到DNS配置变化"
|
||||
|
||||
# 更新备份文件
|
||||
cp "$DNS_CONF" "$DNS_BACKUP"
|
||||
|
||||
# 执行更新脚本
|
||||
if [ -x "$UPDATE_SCRIPT" ]; then
|
||||
log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
|
||||
"$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
|
||||
if [ $? -eq 0 ]; then
|
||||
log_message "DNS更新脚本执行成功"
|
||||
else
|
||||
log_message "DNS更新脚本执行失败"
|
||||
fi
|
||||
else
|
||||
log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
|
||||
fi
|
||||
fi
|
||||
else
|
||||
|
||||
# 第一次检测到配置文件,执行更新脚本
|
||||
if [ -x "$UPDATE_SCRIPT" ]; then
|
||||
log_message "执行DNS更新脚本: $UPDATE_SCRIPT"
|
||||
"$UPDATE_SCRIPT" >> "$LOG_FILE" 2>&1
|
||||
if [ $? -eq 0 ]; then
|
||||
log_message "DNS更新脚本执行成功"
|
||||
|
||||
# 第一次运行,创建备份并执行更新
|
||||
cp "$DNS_CONF" "$DNS_BACKUP"
|
||||
log_message "创建DNS配置备份文件"
|
||||
|
||||
else
|
||||
log_message "DNS更新脚本执行失败"
|
||||
fi
|
||||
else
|
||||
log_message "警告: 更新脚本不存在或不可执行: $UPDATE_SCRIPT"
|
||||
fi
|
||||
fi
|
||||
else
|
||||
log_message "警告: DNS配置文件不存在: $DNS_CONF"
|
||||
fi
|
||||
|
||||
sleep 10
|
||||
done
|
||||
@ -1,22 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# 下载 Alertmanager 离线安装包到本目录,用于 Docker 构建时 COPY
|
||||
# 用法:
|
||||
# ./fetch-dist.sh [version]
|
||||
# 示例:
|
||||
# ./fetch-dist.sh 0.28.1
|
||||
|
||||
VER="${1:-0.28.1}"
|
||||
OUT="alertmanager-${VER}.linux-amd64.tar.gz"
|
||||
URL="https://github.com/prometheus/alertmanager/releases/download/v${VER}/${OUT}"
|
||||
|
||||
if [[ -f "$OUT" ]]; then
|
||||
echo "[INFO] $OUT already exists, skip download"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "[INFO] Downloading $URL"
|
||||
curl -fL --retry 3 --connect-timeout 10 -o "$OUT" "$URL"
|
||||
echo "[OK] Saved to $(pwd)/$OUT"
|
||||
|
||||
@ -1,25 +0,0 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[INFO] Starting Alertmanager under supervisor..."
|
||||
|
||||
ALERTMANAGER_BASE_PATH=${ALERTMANAGER_BASE_PATH:-/private/argus/alert/alertmanager}
|
||||
|
||||
echo "[INFO] Alertmanager base path: ${ALERTMANAGER_BASE_PATH}"
|
||||
|
||||
# 使用容器内的 /etc/alertmanager/alertmanager.yml 作为配置文件,避免写入挂载卷导致的权限问题
|
||||
echo "[INFO] Using /etc/alertmanager/alertmanager.yml as configuration"
|
||||
|
||||
|
||||
# 记录容器 IP 地址
|
||||
DOMAIN=alertmanager.alert.argus.com
|
||||
IP=$(ifconfig | grep -A 1 eth0 | grep inet | awk '{print $2}')
|
||||
echo "current IP: ${IP}"
|
||||
echo "${IP}" > /private/argus/etc/${DOMAIN}
|
||||
chmod 755 /private/argus/etc/${DOMAIN}
|
||||
|
||||
|
||||
echo "[INFO] Starting Alertmanager process..."
|
||||
|
||||
# 启动 Alertmanager 主进程
|
||||
exec /usr/local/alertmanager/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager --cluster.listen-address=""
|
||||
@ -1,39 +0,0 @@
|
||||
[supervisord]
|
||||
nodaemon=true
|
||||
logfile=/var/log/supervisor/supervisord.log
|
||||
pidfile=/var/run/supervisord.pid
|
||||
user=root
|
||||
|
||||
[program:alertmanager]
|
||||
command=/usr/local/bin/start-am-supervised.sh
|
||||
user=ubuntu
|
||||
stdout_logfile=/var/log/supervisor/alertmanager.log
|
||||
stderr_logfile=/var/log/supervisor/alertmanager_error.log
|
||||
autorestart=true
|
||||
startretries=3
|
||||
startsecs=10
|
||||
stopwaitsecs=20
|
||||
killasgroup=true
|
||||
stopasgroup=true
|
||||
|
||||
[program:dns-monitor]
|
||||
command=/usr/local/bin/dns-monitor.sh
|
||||
user=root
|
||||
stdout_logfile=/var/log/supervisor/dns-monitor.log
|
||||
stderr_logfile=/var/log/supervisor/dns-monitor_error.log
|
||||
autorestart=true
|
||||
startretries=3
|
||||
startsecs=5
|
||||
stopwaitsecs=10
|
||||
killasgroup=true
|
||||
stopasgroup=true
|
||||
|
||||
[unix_http_server]
|
||||
file=/var/run/supervisor.sock
|
||||
chmod=0700
|
||||
|
||||
[supervisorctl]
|
||||
serverurl=unix:///var/run/supervisor.sock
|
||||
|
||||
[rpcinterface:supervisor]
|
||||
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
|
||||
@ -1,60 +0,0 @@
|
||||
# 告警配置
|
||||
|
||||
> 参考:[自定义Prometheus告警规则](https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule)
|
||||
|
||||
在Prometheus中配置告警的有两个步骤:
|
||||
|
||||
1. 写告警规则文件(rules文件)
|
||||
2. 在promethues.yml里加载规则,并配置Alertmanager
|
||||
|
||||
## 1. 编写告警规则文件
|
||||
告警规则如下:
|
||||
```yml
|
||||
groups:
|
||||
- name: example-rules
|
||||
interval: 30s # 每30秒评估一次
|
||||
rules:
|
||||
- alert: InstanceDown
|
||||
expr: up == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "实例 {{ $labels.instance }} 已宕机"
|
||||
description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
|
||||
|
||||
- alert: HighCpuUsage
|
||||
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "CPU 使用率过高"
|
||||
description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
|
||||
```
|
||||
|
||||
其中:
|
||||
|
||||
- `alert`:告警规则的名称。
|
||||
- `expr`:基于PromQL表达式告警触发条件,用于计算是否有时间序列满足该条件。
|
||||
- `for`:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
|
||||
- `labels`:自定义标签,允许用户指定要附加到告警上的一组附加标签,可以在Alertmanager中做路由和分组。
|
||||
- `annotations`:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations的内容在告警产生时会一同作为参数发送到Alertmanager。可以提供告警摘要和详细信息。
|
||||
|
||||
## 2. promothues.yml里引用
|
||||
在prometheus.yml中加上`rule_files`和`alerting`:
|
||||
|
||||
```yml
|
||||
global:
|
||||
[ evaluation_interval: <duration> | default = 1m ]
|
||||
|
||||
rule_files:
|
||||
[ - <filepath_glob> ... ]
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets:
|
||||
- "alertmanager.alert.argus.com:9093" # Alertmanager 地址
|
||||
|
||||
```
|
||||
@ -1,37 +0,0 @@
|
||||
groups:
|
||||
- name: example-rules
|
||||
interval: 30s # 每30秒评估一次
|
||||
rules:
|
||||
- alert: InstanceDown
|
||||
expr: up == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "实例 {{ $labels.instance }} 已宕机"
|
||||
description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
|
||||
|
||||
- alert: HighCpuUsage
|
||||
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "CPU 使用率过高"
|
||||
description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
|
||||
- alert: HighMemoryUsage
|
||||
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "内存使用率过高"
|
||||
description: "实例 {{ $labels.instance }} 内存使用率超过 80% 持续 5 分钟。"
|
||||
- alert: DiskSpaceLow
|
||||
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} - node_filesystem_free_bytes{fstype!~"tmpfs|overlay"}) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100 > 90
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "磁盘空间不足"
|
||||
description: "实例 {{ $labels.instance }} 磁盘空间不足超过 90% 持续 10 分钟。"
|
||||
@ -1,5 +0,0 @@
|
||||
DATA_ROOT=/home/argus/tmp/private/argus
|
||||
ARGUS_BUILD_UID=1048
|
||||
ARGUS_BUILD_GID=1048
|
||||
|
||||
USE_INTRANET=false
|
||||
@ -1,5 +0,0 @@
|
||||
DATA_ROOT=/home/argus/tmp/private/argus
|
||||
ARGUS_BUILD_UID=1048
|
||||
ARGUS_BUILD_GID=1048
|
||||
|
||||
USE_INTRANET=false
|
||||
@ -1,37 +0,0 @@
|
||||
services:
|
||||
alertmanager:
|
||||
build:
|
||||
context: ../../../
|
||||
dockerfile: src/alert/alertmanager/build/Dockerfile
|
||||
args:
|
||||
ARGUS_BUILD_UID: ${ARGUS_BUILD_UID:-2133}
|
||||
ARGUS_BUILD_GID: ${ARGUS_BUILD_GID:-2015}
|
||||
USE_INTRANET: ${USE_INTRANET:-false}
|
||||
image: argus-alertmanager:latest
|
||||
container_name: argus-alertmanager
|
||||
environment:
|
||||
- ALERTMANAGER_BASE_PATH=/private/argus/alert/alertmanager
|
||||
- ARGUS_BUILD_UID=${ARGUS_BUILD_UID:-2133}
|
||||
- ARGUS_BUILD_GID=${ARGUS_BUILD_GID:-2015}
|
||||
ports:
|
||||
- "${ARGUS_PORT:-9093}:9093"
|
||||
volumes:
|
||||
- ${DATA_ROOT:-./data}/alert/alertmanager:/private/argus/alert/alertmanager
|
||||
- ${DATA_ROOT:-./data}/etc:/private/argus/etc
|
||||
networks:
|
||||
- argus-debug-net
|
||||
restart: unless-stopped
|
||||
logging:
|
||||
driver: "json-file"
|
||||
options:
|
||||
max-size: "10m"
|
||||
max-file: "3"
|
||||
|
||||
networks:
|
||||
argus-debug-net:
|
||||
driver: bridge
|
||||
name: argus-debug-net
|
||||
|
||||
volumes:
|
||||
alertmanager_data:
|
||||
driver: local
|
||||
@ -1,113 +0,0 @@
|
||||
#!/bin/bash
|
||||
# verify_alertmanager.sh
|
||||
# 用于部署后验证 Prometheus 与 Alertmanager 通信链路是否正常
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
#=============================
|
||||
# 基础配置
|
||||
#=============================
|
||||
PROM_URL="${PROM_URL:-http://prom.metric.argus.com:9090}"
|
||||
ALERT_URL="${ALERT_URL:-http://alertmanager.alert.argus.com:9093}"
|
||||
# TODO: 根据实际部署环境调整规则目录
|
||||
DATA_ROOT="${DATA_ROOT:-/private/argus}"
|
||||
RULE_DIR = "$DATA_ROOT/metric/prometheus/rules"
|
||||
TMP_RULE="/tmp/test_rule.yml"
|
||||
|
||||
#=============================
|
||||
# 辅助函数
|
||||
#=============================
|
||||
GREEN="\033[32m"; RED="\033[31m"; YELLOW="\033[33m"; RESET="\033[0m"
|
||||
|
||||
log_info() { echo -e "${YELLOW}[INFO]${RESET} $1"; }
|
||||
log_success() { echo -e "${GREEN}[OK]${RESET} $1"; }
|
||||
log_error() { echo -e "${RED}[ERROR]${RESET} $1"; }
|
||||
|
||||
fail_exit() { log_error "$1"; exit 1; }
|
||||
|
||||
#=============================
|
||||
# Step 1: 检查 Alertmanager 是否可访问
|
||||
#=============================
|
||||
log_info "检查 Alertmanager 状态..."
|
||||
if curl -sSf "${ALERT_URL}/api/v2/status" >/dev/null 2>&1; then
|
||||
log_success "Alertmanager 服务正常 (${ALERT_URL})"
|
||||
else
|
||||
fail_exit "无法访问 Alertmanager,请检查端口映射与容器状态。"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 2: 手动发送测试告警
|
||||
#=============================
|
||||
log_info "发送手动测试告警..."
|
||||
curl -s -XPOST "${ALERT_URL}/api/v2/alerts" -H "Content-Type: application/json" -d '[
|
||||
{
|
||||
"labels": {
|
||||
"alertname": "ManualTestAlert",
|
||||
"severity": "info"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "This is a test alert from deploy verification"
|
||||
},
|
||||
"startsAt": "'$(date -Iseconds)'"
|
||||
}
|
||||
]' >/dev/null && log_success "测试告警已成功发送到 Alertmanager"
|
||||
|
||||
#=============================
|
||||
# Step 3: 检查 Prometheus 配置中是否包含 Alertmanager
|
||||
#=============================
|
||||
log_info "检查 Prometheus 是否配置了 Alertmanager..."
|
||||
if curl -s "${PROM_URL}/api/v1/status/config" | grep -q "alertmanagers"; then
|
||||
log_success "Prometheus 已配置 Alertmanager 目标"
|
||||
else
|
||||
fail_exit "Prometheus 未配置 Alertmanager,请检查 prometheus.yml"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 4: 创建并加载测试告警规则
|
||||
#=============================
|
||||
log_info "创建临时测试规则 ${TMP_RULE} ..."
|
||||
cat <<EOF > "${TMP_RULE}"
|
||||
groups:
|
||||
- name: deploy-verify-group
|
||||
rules:
|
||||
- alert: DeployVerifyAlert
|
||||
expr: vector(1)
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Deployment verification alert"
|
||||
EOF
|
||||
|
||||
mkdir -p "${RULE_DIR}"
|
||||
cp "${TMP_RULE}" "${RULE_DIR}/test_rule.yml"
|
||||
|
||||
log_info "重载 Prometheus 以加载新规则..."
|
||||
if curl -s -X POST "${PROM_URL}/-/reload" >/dev/null; then
|
||||
log_success "Prometheus 已重载规则"
|
||||
else
|
||||
fail_exit "Prometheus reload 失败,请检查 API 可访问性。"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 5: 等待并验证 Alertmanager 是否收到告警
|
||||
#=============================
|
||||
log_info "等待告警触发 (约5秒)..."
|
||||
sleep 5
|
||||
|
||||
if curl -s "${ALERT_URL}/api/v2/alerts" | grep -q "DeployVerifyAlert"; then
|
||||
log_success "Prometheus → Alertmanager 告警链路验证成功"
|
||||
else
|
||||
fail_exit "未在 Alertmanager 中检测到 DeployVerifyAlert,请检查网络或配置。"
|
||||
fi
|
||||
|
||||
#=============================
|
||||
# Step 6: 清理测试规则
|
||||
#=============================
|
||||
log_info "清理临时测试规则..."
|
||||
rm -f "${RULE_DIR}/test_rule.yml" "${TMP_RULE}"
|
||||
|
||||
curl -s -X POST "${PROM_URL}/-/reload" >/dev/null \
|
||||
&& log_success "Prometheus 已清理验证规则" \
|
||||
|| log_error "Prometheus reload 清理失败,请手动确认。"
|
||||
|
||||
log_success "部署验证全部通过!Prometheus ↔ Alertmanager 通信正常。"
|
||||
@ -26,7 +26,6 @@ RUN apt-get update && \
|
||||
apt-get install -y \
|
||||
bind9 \
|
||||
bind9utils \
|
||||
dnsutils \
|
||||
bind9-doc \
|
||||
supervisor \
|
||||
net-tools \
|
||||
|
||||
@ -17,9 +17,6 @@ log_message() {
|
||||
|
||||
log_message "DNS监控脚本启动"
|
||||
|
||||
log_message "删除DNS备份文件(如果存在)"
|
||||
rm -f $DNS_BACKUP
|
||||
|
||||
while true; do
|
||||
if [ -f "$DNS_CONF" ]; then
|
||||
if [ -f "$DNS_BACKUP" ]; then
|
||||
|
||||
1
src/bundle/cpu-node-bundle/.gitignore
vendored
1
src/bundle/cpu-node-bundle/.gitignore
vendored
@ -1 +0,0 @@
|
||||
.build*/
|
||||
@ -1,33 +0,0 @@
|
||||
FROM ubuntu:22.04
|
||||
|
||||
ARG ARGUS_BUILD_UID=2133
|
||||
ARG ARGUS_BUILD_GID=2015
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive \
|
||||
TZ=Asia/Shanghai \
|
||||
ARGUS_LOGS_WORLD_WRITABLE=1
|
||||
|
||||
RUN set -eux; \
|
||||
apt-get update; \
|
||||
apt-get install -y --no-install-recommends \
|
||||
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata \
|
||||
cron procps supervisor vim less tar gzip python3; \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||
|
||||
WORKDIR /
|
||||
|
||||
# Offline fluent-bit assets and bundle tarball are staged by the build script
|
||||
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
|
||||
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
|
||||
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
|
||||
COPY private/etc /private/etc
|
||||
COPY private/packages /private/packages
|
||||
COPY bundle/ /bundle/
|
||||
|
||||
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
|
||||
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
|
||||
if [ "${ARGUS_LOGS_WORLD_WRITABLE}" = "1" ]; then chmod 1777 /logs/train /logs/infer || true; else chmod 755 /logs/train /logs/infer || true; fi; \
|
||||
chmod 770 /buffers || true
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]
|
||||
@ -1,59 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# health-watcher.sh (CPU node bundle)
|
||||
# 周期执行 check_health.sh 与 restart_unhealthy.sh,用于节点容器内自愈。
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
|
||||
VER_DIR="${1:-}"
|
||||
|
||||
log(){ echo "[HEALTH-WATCHER] $*"; }
|
||||
|
||||
resolve_ver_dir() {
|
||||
local dir=""
|
||||
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
|
||||
dir="$VER_DIR"
|
||||
elif [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$dir" ]]; then
|
||||
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
echo "$dir"
|
||||
}
|
||||
|
||||
main() {
|
||||
log "starting with interval=${INTERVAL}s"
|
||||
local dir
|
||||
dir="$(resolve_ver_dir)"
|
||||
if [[ -z "$dir" || ! -d "$dir" ]]; then
|
||||
log "no valid install dir found under $INSTALL_ROOT; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
local chk="$dir/check_health.sh"
|
||||
local rst="$dir/restart_unhealthy.sh"
|
||||
|
||||
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
|
||||
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "watching install dir: $dir"
|
||||
|
||||
while :; do
|
||||
if [[ -x "$chk" ]]; then
|
||||
log "running check_health.sh"
|
||||
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
|
||||
fi
|
||||
if [[ -x "$rst" ]]; then
|
||||
log "running restart_unhealthy.sh"
|
||||
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
|
||||
fi
|
||||
sleep "$INTERVAL"
|
||||
done
|
||||
}
|
||||
|
||||
main "$@"
|
||||
|
||||
@ -1,131 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[BOOT] CPU node bundle starting"
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
BUNDLE_DIR="/bundle"
|
||||
STATE_DIR_BASE="/private/argus/agent"
|
||||
|
||||
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
|
||||
|
||||
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
|
||||
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
|
||||
chmod 1777 /logs/train /logs/infer || true
|
||||
else
|
||||
chmod 755 /logs/train /logs/infer || true
|
||||
fi
|
||||
chmod 770 /buffers || true
|
||||
|
||||
installed_ok=0
|
||||
|
||||
# 1) already installed?
|
||||
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
|
||||
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
|
||||
else
|
||||
# 2) try local bundle first (argus-metric_*.tar.gz)
|
||||
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
|
||||
if [[ -n "${tarball:-}" ]]; then
|
||||
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
|
||||
tmp=$(mktemp -d)
|
||||
tar -xzf "$tarball" -C "$tmp"
|
||||
# locate root containing version.json
|
||||
root="$tmp"
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
|
||||
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
|
||||
fi
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
|
||||
else
|
||||
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
|
||||
if [[ -z "$ver" ]]; then
|
||||
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
|
||||
else
|
||||
target_root="$INSTALL_ROOT"
|
||||
version_dir="$target_root/versions/$ver"
|
||||
mkdir -p "$version_dir"
|
||||
shopt -s dotglob
|
||||
mv "$root"/* "$version_dir/" 2>/dev/null || true
|
||||
shopt -u dotglob
|
||||
if [[ -f "$version_dir/install.sh" ]]; then
|
||||
chmod +x "$version_dir/install.sh" 2>/dev/null || true
|
||||
(
|
||||
export AUTO_START_DCGM="0" # N/A on CPU
|
||||
cd "$version_dir" && ./install.sh "$version_dir"
|
||||
)
|
||||
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
|
||||
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
|
||||
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
|
||||
installed_ok=1
|
||||
echo "[BOOT] local bundle install OK: version=$ver"
|
||||
else
|
||||
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3) fallback: use FTP setup if not installed
|
||||
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
|
||||
echo "[BOOT] fallback to FTP setup"
|
||||
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
|
||||
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
|
||||
exit 1
|
||||
fi
|
||||
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
|
||||
chmod +x /tmp/setup.sh
|
||||
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
|
||||
fi
|
||||
fi
|
||||
|
||||
# 4) ensure argus-agent is running (best-effort)
|
||||
if ! pgrep -x argus-agent >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting argus-agent (not detected)"
|
||||
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
|
||||
fi
|
||||
|
||||
# 5) post-install selfcheck and state
|
||||
ver_dir=""
|
||||
if [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$ver_dir" ]]; then
|
||||
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
|
||||
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
|
||||
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
|
||||
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
|
||||
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
|
||||
fi
|
||||
|
||||
host="$(hostname)"
|
||||
state_dir="$STATE_DIR_BASE/${host}"
|
||||
mkdir -p "$state_dir" 2>/dev/null || true
|
||||
for i in {1..60}; do
|
||||
if [[ -s "$state_dir/node.json" ]]; then
|
||||
echo "[BOOT] node state present: $state_dir/node.json"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# 6) spawn health watcher (best-effort, non-blocking)
|
||||
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting health watcher for $ver_dir"
|
||||
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
|
||||
else
|
||||
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
|
||||
fi
|
||||
|
||||
echo "[BOOT] ready; entering sleep"
|
||||
exec sleep infinity
|
||||
1
src/bundle/gpu-node-bundle/.gitignore
vendored
1
src/bundle/gpu-node-bundle/.gitignore
vendored
@ -1 +0,0 @@
|
||||
.build*/
|
||||
@ -1,44 +0,0 @@
|
||||
ARG CUDA_VER=12.2.2
|
||||
FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu22.04
|
||||
|
||||
ARG CLIENT_VER=0.0.0
|
||||
ARG BUNDLE_DATE=00000000
|
||||
|
||||
LABEL org.opencontainers.image.title="argus-sys-metric-test-node-bundle-gpu" \
|
||||
org.opencontainers.image.description="GPU node bundle with embedded Argus client artifact" \
|
||||
org.opencontainers.image.version="${CLIENT_VER}" \
|
||||
org.opencontainers.image.revision_date="${BUNDLE_DATE}" \
|
||||
maintainer="Argus"
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive \
|
||||
TZ=Asia/Shanghai \
|
||||
ARGUS_LOGS_WORLD_WRITABLE=1 \
|
||||
ES_HOST=es.log.argus.com \
|
||||
ES_PORT=9200 \
|
||||
CLUSTER=local \
|
||||
RACK=dev
|
||||
|
||||
RUN set -eux; \
|
||||
apt-get update; \
|
||||
apt-get install -y --no-install-recommends \
|
||||
ca-certificates curl wget iproute2 iputils-ping net-tools jq tzdata cron procps vim less \
|
||||
tar gzip; \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||
|
||||
WORKDIR /
|
||||
|
||||
# Expect staged build context to provide these directories/files
|
||||
COPY bundle/ /bundle/
|
||||
COPY node-bootstrap.sh /usr/local/bin/node-bootstrap.sh
|
||||
COPY health-watcher.sh /usr/local/bin/health-watcher.sh
|
||||
COPY private/start-fluent-bit.sh /private/start-fluent-bit.sh
|
||||
COPY private/etc /private/etc
|
||||
COPY private/packages /private/packages
|
||||
|
||||
RUN chmod +x /usr/local/bin/node-bootstrap.sh /usr/local/bin/health-watcher.sh /private/start-fluent-bit.sh || true; \
|
||||
mkdir -p /logs/train /logs/infer /buffers /opt/argus-metric; \
|
||||
chmod 1777 /logs/train /logs/infer || true; \
|
||||
chmod 770 /buffers || true
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/node-bootstrap.sh"]
|
||||
@ -1,59 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# health-watcher.sh (GPU bundle)
|
||||
# 周期执行 check_health.sh 与 restart_unhealthy.sh,用于 GPU 节点容器内自愈。
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
INTERVAL="${HEALTH_WATCH_INTERVAL:-60}"
|
||||
VER_DIR="${1:-}"
|
||||
|
||||
log(){ echo "[HEALTH-WATCHER] $*"; }
|
||||
|
||||
resolve_ver_dir() {
|
||||
local dir=""
|
||||
if [[ -n "${VER_DIR:-}" && -d "$VER_DIR" ]]; then
|
||||
dir="$VER_DIR"
|
||||
elif [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$dir" ]]; then
|
||||
dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
echo "$dir"
|
||||
}
|
||||
|
||||
main() {
|
||||
log "starting with interval=${INTERVAL}s"
|
||||
local dir
|
||||
dir="$(resolve_ver_dir)"
|
||||
if [[ -z "$dir" || ! -d "$dir" ]]; then
|
||||
log "no valid install dir found under $INSTALL_ROOT; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
local chk="$dir/check_health.sh"
|
||||
local rst="$dir/restart_unhealthy.sh"
|
||||
|
||||
if [[ ! -x "$chk" && ! -x "$rst" ]]; then
|
||||
log "neither check_health.sh nor restart_unhealthy.sh is executable under $dir; exiting"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "watching install dir: $dir"
|
||||
|
||||
while :; do
|
||||
if [[ -x "$chk" ]]; then
|
||||
log "running check_health.sh"
|
||||
"$chk" >> "$dir/.health_check.watch.log" 2>&1 || log "check_health.sh reported issues (see .health_check.watch.log)"
|
||||
fi
|
||||
if [[ -x "$rst" ]]; then
|
||||
log "running restart_unhealthy.sh"
|
||||
"$rst" >> "$dir/.restart.watch.log" 2>&1 || log "restart_unhealthy.sh reported issues (see .restart.watch.log)"
|
||||
fi
|
||||
sleep "$INTERVAL"
|
||||
done
|
||||
}
|
||||
|
||||
main "$@"
|
||||
|
||||
@ -1,135 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[BOOT] GPU node bundle starting"
|
||||
|
||||
INSTALL_ROOT="/opt/argus-metric"
|
||||
BUNDLE_DIR="/bundle"
|
||||
STATE_DIR_BASE="/private/argus/agent"
|
||||
|
||||
mkdir -p "$INSTALL_ROOT" "$STATE_DIR_BASE" /logs/train /logs/infer /buffers || true
|
||||
|
||||
# Ensure world-writable logs dir with sticky bit (align with deployment_new policy)
|
||||
if [[ "${ARGUS_LOGS_WORLD_WRITABLE:-1}" == "1" ]]; then
|
||||
chmod 1777 /logs/train /logs/infer || true
|
||||
else
|
||||
chmod 755 /logs/train /logs/infer || true
|
||||
fi
|
||||
chmod 770 /buffers || true
|
||||
|
||||
installed_ok=0
|
||||
|
||||
# 1) already installed?
|
||||
if [[ -L "$INSTALL_ROOT/current" && -d "$INSTALL_ROOT/current" ]]; then
|
||||
echo "[BOOT] client already installed at $INSTALL_ROOT/current"
|
||||
else
|
||||
# 2) try local bundle first (argus-metric_*.tar.gz)
|
||||
tarball=$(ls -1 "$BUNDLE_DIR"/argus-metric_*.tar.gz 2>/dev/null | head -1 || true)
|
||||
if [[ -n "${tarball:-}" ]]; then
|
||||
echo "[BOOT] installing from local bundle: $(basename "$tarball")"
|
||||
tmp=$(mktemp -d)
|
||||
tar -xzf "$tarball" -C "$tmp"
|
||||
# locate root containing version.json
|
||||
root="$tmp"
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
sub=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -n1 || true)
|
||||
[[ -n "$sub" && -f "$sub/version.json" ]] && root="$sub"
|
||||
fi
|
||||
if [[ ! -f "$root/version.json" ]]; then
|
||||
echo "[BOOT][WARN] version.json not found in bundle; fallback to FTP"
|
||||
else
|
||||
ver=$(sed -n 's/.*"version"\s*:\s*"\([^"]\+\)".*/\1/p' "$root/version.json" | head -n1)
|
||||
if [[ -z "$ver" ]]; then
|
||||
echo "[BOOT][WARN] failed to parse version from version.json; fallback to FTP"
|
||||
else
|
||||
target_root="$INSTALL_ROOT"
|
||||
version_dir="$target_root/versions/$ver"
|
||||
mkdir -p "$version_dir"
|
||||
shopt -s dotglob
|
||||
mv "$root"/* "$version_dir/" 2>/dev/null || true
|
||||
shopt -u dotglob
|
||||
if [[ -f "$version_dir/install.sh" ]]; then
|
||||
chmod +x "$version_dir/install.sh" 2>/dev/null || true
|
||||
(
|
||||
export AUTO_START_DCGM="${AUTO_START_DCGM:-1}"
|
||||
export DCGM_EXPORTER_DISABLE_PROFILING="${DCGM_EXPORTER_DISABLE_PROFILING:-1}"
|
||||
export DCGM_EXPORTER_LISTEN="${DCGM_EXPORTER_LISTEN:-:9400}"
|
||||
cd "$version_dir" && ./install.sh "$version_dir"
|
||||
)
|
||||
echo "$ver" > "$target_root/LATEST_VERSION" 2>/dev/null || true
|
||||
ln -sfn "$version_dir" "$target_root/current" 2>/dev/null || true
|
||||
if [[ -L "$target_root/current" && -d "$target_root/current" ]]; then
|
||||
installed_ok=1
|
||||
echo "[BOOT] local bundle install OK: version=$ver"
|
||||
else
|
||||
echo "[BOOT][WARN] current symlink not present after install; will rely on healthcheck to confirm"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] install.sh missing under $version_dir; fallback to FTP"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3) fallback: use FTP setup if not installed
|
||||
if [[ ! -L "$INSTALL_ROOT/current" && "$installed_ok" -eq 0 ]]; then
|
||||
echo "[BOOT] fallback to FTP setup"
|
||||
if [[ -z "${FTPIP:-}" || -z "${FTP_USER:-}" || -z "${FTP_PASSWORD:-}" ]]; then
|
||||
echo "[BOOT][ERROR] FTP variables not set (FTPIP/FTP_USER/FTP_PASSWORD)" >&2
|
||||
exit 1
|
||||
fi
|
||||
curl -u "$FTP_USER:$FTP_PASSWORD" -fsSL "ftp://$FTPIP:21/setup.sh" -o /tmp/setup.sh
|
||||
chmod +x /tmp/setup.sh
|
||||
/tmp/setup.sh --server "$FTPIP" --user "$FTP_USER" --password "$FTP_PASSWORD" --port 21
|
||||
fi
|
||||
fi
|
||||
|
||||
# 4) ensure argus-agent is running (best-effort)
|
||||
if ! pgrep -x argus-agent >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting argus-agent (not detected)"
|
||||
setsid /usr/local/bin/argus-agent >/var/log/argus-agent.log 2>&1 < /dev/null &
|
||||
fi
|
||||
|
||||
# 5) post-install selfcheck (run once) and state
|
||||
# prefer current version dir; fallback to first version under /opt/argus-metric/versions
|
||||
ver_dir=""
|
||||
if [[ -L "$INSTALL_ROOT/current" ]]; then
|
||||
ver_dir="$(readlink -f "$INSTALL_ROOT/current" 2>/dev/null || true)"
|
||||
fi
|
||||
if [[ -z "$ver_dir" ]]; then
|
||||
# pick the latest by name (semver-like); best-effort
|
||||
ver_dir="$(ls -d "$INSTALL_ROOT"/versions/* 2>/dev/null | sort -V | tail -n1 || true)"
|
||||
fi
|
||||
|
||||
if [[ -n "$ver_dir" && -x "$ver_dir/check_health.sh" ]]; then
|
||||
echo "[BOOT] running initial health check: $ver_dir/check_health.sh"
|
||||
if "$ver_dir/check_health.sh" >> "$ver_dir/.health_check.init.log" 2>&1; then
|
||||
echo "[BOOT] initial health check completed (see $ver_dir/.health_check.init.log)"
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check reported issues (see $ver_dir/.health_check.init.log)"
|
||||
fi
|
||||
else
|
||||
echo "[BOOT][WARN] initial health check skipped (script missing: $ver_dir/check_health.sh)"
|
||||
fi
|
||||
|
||||
host="$(hostname)"
|
||||
state_dir="$STATE_DIR_BASE/${host}"
|
||||
mkdir -p "$state_dir" 2>/dev/null || true
|
||||
for i in {1..60}; do
|
||||
if [[ -s "$state_dir/node.json" ]]; then
|
||||
echo "[BOOT] node state present: $state_dir/node.json"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# 6) spawn health watcher (best-effort, non-blocking)
|
||||
if command -v /usr/local/bin/health-watcher.sh >/dev/null 2>&1; then
|
||||
echo "[BOOT] starting health watcher for $ver_dir"
|
||||
setsid /usr/local/bin/health-watcher.sh "${ver_dir:-}" >/var/log/health-watcher.log 2>&1 < /dev/null || true &
|
||||
else
|
||||
echo "[BOOT][WARN] health-watcher.sh not found; skip health watcher"
|
||||
fi
|
||||
|
||||
echo "[BOOT] ready; entering sleep"
|
||||
exec sleep infinity
|
||||
@ -22,7 +22,8 @@
|
||||
[PARSER]
|
||||
Name timestamp_parser
|
||||
Format regex
|
||||
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:Z|[+-]\d{2}:?\d{2}))\s+(?<level>\w+)\s+(?<message>.*)$
|
||||
Regex ^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<message>.*)$
|
||||
Time_Key timestamp
|
||||
Time_Format %Y-%m-%dT%H:%M:%S%z
|
||||
Time_Format %Y-%m-%d %H:%M:%S
|
||||
Time_Offset +0800
|
||||
Time_Keep On
|
||||
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -1,109 +1,47 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
echo "[INFO] Starting Fluent Bit setup in Ubuntu container (offline-first)..."
|
||||
echo "[INFO] Starting Fluent Bit setup in Ubuntu container..."
|
||||
|
||||
# 安装必要的工具
|
||||
echo "[INFO] Installing required packages..."
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
apt-get update -qq
|
||||
apt-get install -y -qq curl
|
||||
|
||||
# Stage bundle to /tmp (read-only mount under /private)
|
||||
echo "[INFO] Staging fluent-bit bundle..."
|
||||
rm -rf /tmp/flb && mkdir -p /tmp/flb
|
||||
cp -r /private/etc /tmp/flb/
|
||||
mkdir -p /tmp/flb/packages
|
||||
cp -r /private/packages/* /tmp/flb/packages/ 2>/dev/null || true
|
||||
# 解压bundle到/tmp
|
||||
echo "[INFO] Extracting fluent-bit bundle..."
|
||||
cp -r /private/etc /tmp
|
||||
cp -r /private/packages /tmp
|
||||
cd /tmp
|
||||
|
||||
# Helper: check and install a local deb if not already satisfied
|
||||
ensure_lib() {
|
||||
local soname="$1"; shift
|
||||
local pattern="$1"; shift
|
||||
if ldconfig -p 2>/dev/null | grep -q "$soname"; then
|
||||
echo "[OK] $soname already present"
|
||||
return 0
|
||||
fi
|
||||
local deb="$(ls /tmp/flb/packages/$pattern 2>/dev/null | head -n1 || true)"
|
||||
if [[ -n "$deb" ]]; then
|
||||
echo "[INFO] Installing local dependency: $(basename "$deb")"
|
||||
dpkg -i "$deb" >/dev/null 2>&1 || true
|
||||
else
|
||||
echo "[WARN] Local deb for $soname not found (pattern=$pattern)"
|
||||
fi
|
||||
if ! ldconfig -p 2>/dev/null | grep -q "$soname"; then
|
||||
echo "[WARN] $soname still missing after local install; attempting apt fallback"
|
||||
apt-get update -qq || true
|
||||
case "$soname" in
|
||||
libpq.so.5) apt-get install -y -qq libpq5 || true ;;
|
||||
libyaml-0.so.2) apt-get install -y -qq libyaml-0-2 || true ;;
|
||||
esac
|
||||
fi
|
||||
ldconfig 2>/dev/null || true
|
||||
}
|
||||
|
||||
# Offline-first: satisfy runtime deps from local debs, fallback to apt only if necessary
|
||||
ensure_lib "libpq.so.5" "libpq5_*_amd64.deb"
|
||||
ensure_lib "libyaml-0.so.2" "libyaml-0-2_*_amd64.deb"
|
||||
ensure_lib "libsasl2.so.2" "libsasl2-2_*_amd64.deb"
|
||||
ensure_lib "libldap-2.5.so.0" "libldap-2.5-0_*_amd64.deb"
|
||||
|
||||
# Install fluent-bit main package from local bundle
|
||||
FLB_DEB="$(ls /tmp/flb/packages/fluent-bit_*_amd64.deb 2>/dev/null | head -n1 || true)"
|
||||
if [[ -z "$FLB_DEB" ]]; then
|
||||
echo "[ERROR] fluent-bit deb not found under /private/packages" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[INFO] Installing Fluent Bit: $(basename "$FLB_DEB")"
|
||||
dpkg -i "$FLB_DEB" >/dev/null 2>&1 || true
|
||||
|
||||
# If dpkg reported unresolved dependencies, try apt -f only as last resort
|
||||
if ! command -v /opt/fluent-bit/bin/fluent-bit >/dev/null 2>&1; then
|
||||
echo "[WARN] fluent-bit binary missing after dpkg; attempting apt --fix-broken"
|
||||
apt-get install -f -y -qq || true
|
||||
fi
|
||||
|
||||
# Ensure runtime library dependencies are satisfied (libsasl2, libldap are required via libpq/curl)
|
||||
MISSING=$(ldd /opt/fluent-bit/bin/fluent-bit 2>/dev/null | awk '/not found/{print $1}' | xargs -r echo || true)
|
||||
if [[ -n "$MISSING" ]]; then
|
||||
echo "[WARN] missing shared libs: $MISSING"
|
||||
apt-get update -qq || true
|
||||
apt-get install -y -qq libsasl2-2 libldap-2.5-0 || true
|
||||
apt-get install -f -y -qq || true
|
||||
fi
|
||||
# 安装 Fluent Bit 从 deb 包
|
||||
echo "[INFO] Installing Fluent Bit from deb package..."
|
||||
dpkg -i /tmp/packages/fluent-bit_3.1.9_amd64.deb || true
|
||||
apt-get install -f -y -qq # 解决依赖问题
|
||||
|
||||
# 验证 Fluent Bit 可以运行
|
||||
echo "[INFO] Fluent Bit version:"
|
||||
/opt/fluent-bit/bin/fluent-bit --version || { echo "[ERROR] fluent-bit not installed or libraries missing" >&2; exit 1; }
|
||||
/opt/fluent-bit/bin/fluent-bit --version
|
||||
|
||||
# Place configuration
|
||||
# 创建配置目录
|
||||
mkdir -p /etc/fluent-bit
|
||||
cp -r /tmp/flb/etc/* /etc/fluent-bit/
|
||||
cp -r /tmp/etc/* /etc/fluent-bit/
|
||||
|
||||
# Create logs/buffers dirs
|
||||
# 创建日志和缓冲区目录
|
||||
mkdir -p /logs/train /logs/infer /buffers
|
||||
chmod 755 /logs/train /logs/infer /buffers
|
||||
|
||||
# 控制日志目录权限:默认对宿主 bind mount 目录采用 1777(可由环境变量关闭)
|
||||
: "${ARGUS_LOGS_WORLD_WRITABLE:=1}"
|
||||
if [[ "${ARGUS_LOGS_WORLD_WRITABLE}" == "1" ]]; then
|
||||
chmod 1777 /logs/train /logs/infer || true
|
||||
else
|
||||
chmod 755 /logs/train /logs/infer || true
|
||||
fi
|
||||
|
||||
# 缓冲目录仅供进程使用,不对外开放写入
|
||||
chmod 770 /buffers || true
|
||||
|
||||
# 目录属主设置为 fluent-bit(不影响 1777 粘滞位)
|
||||
chown -R fluent-bit:fluent-bit /logs /buffers 2>/dev/null || true
|
||||
|
||||
# Wait for Elasticsearch via bash /dev/tcp to avoid curl dependency
|
||||
echo "[INFO] Waiting for Elasticsearch to be ready (tcp ${ES_HOST}:${ES_PORT})..."
|
||||
for i in $(seq 1 120); do
|
||||
if exec 3<>/dev/tcp/${ES_HOST}/${ES_PORT}; then
|
||||
exec 3<&- 3>&-
|
||||
echo "[INFO] Elasticsearch is ready"
|
||||
break
|
||||
fi
|
||||
[[ $i -eq 120 ]] && { echo "[ERROR] ES not reachable" >&2; exit 1; }
|
||||
sleep 1
|
||||
# 等待 Elasticsearch 就绪
|
||||
echo "[INFO] Waiting for Elasticsearch to be ready..."
|
||||
while ! curl -fs http://${ES_HOST}:${ES_PORT}/_cluster/health >/dev/null 2>&1; do
|
||||
echo " Waiting for ES at ${ES_HOST}:${ES_PORT}..."
|
||||
sleep 5
|
||||
done
|
||||
echo "[INFO] Elasticsearch is ready"
|
||||
|
||||
# 启动 Fluent Bit
|
||||
echo "[INFO] Starting Fluent Bit with configuration from /etc/fluent-bit/"
|
||||
echo "[INFO] Command: /opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf"
|
||||
exec /opt/fluent-bit/bin/fluent-bit --config=/etc/fluent-bit/fluent-bit.conf
|
||||
exec /opt/fluent-bit/bin/fluent-bit \
|
||||
--config=/etc/fluent-bit/fluent-bit.conf
|
||||
|
||||
@ -2,7 +2,7 @@
|
||||
set -euo pipefail
|
||||
|
||||
ES_HOST="${ELASTICSEARCH_HOSTS:-http://es:9200}"
|
||||
KB_HOST="${KB_HOST:-http://127.0.0.1:5601}"
|
||||
KB_HOST="http://localhost:5601"
|
||||
|
||||
echo "[INFO] Starting Kibana post-start configuration..."
|
||||
|
||||
@ -83,37 +83,50 @@ fix_replicas_idempotent() {
|
||||
}
|
||||
|
||||
# 幂等创建数据视图
|
||||
create_or_ensure_data_view() {
|
||||
local name="$1"
|
||||
local title="$2"
|
||||
|
||||
local list_response
|
||||
list_response=$(curl -fsS "$KB_HOST/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null || echo "")
|
||||
|
||||
if [ -z "$list_response" ]; then
|
||||
echo "[WARN] Failed to list data views, skipping creation check for $title"
|
||||
return
|
||||
fi
|
||||
|
||||
if echo "$list_response" | grep -Fq "\"title\":\"$title\""; then
|
||||
echo "[INFO] Data view $title already exists, skipping"
|
||||
return
|
||||
fi
|
||||
|
||||
echo "[INFO] Creating data view for $title indices (allowNoIndex)"
|
||||
|
||||
curl -fsS -X POST "$KB_HOST/api/data_views/data_view?allowNoIndex=true" \
|
||||
-H 'kbn-xsrf: true' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{\"data_view\":{\"name\":\"$name\",\"title\":\"$title\",\"timeFieldName\":\"@timestamp\",\"allowNoIndex\":true}}" \
|
||||
>/dev/null && echo "[OK] Created $name data view" || echo "[WARN] Failed to create $name data view"
|
||||
}
|
||||
|
||||
create_data_views_idempotent() {
|
||||
echo "[INFO] Checking and creating data views..."
|
||||
|
||||
create_or_ensure_data_view "train" "train-*"
|
||||
create_or_ensure_data_view "infer" "infer-*"
|
||||
# 检查是否存在匹配的索引
|
||||
local train_indices=$(curl -s "$ES_HOST/_cat/indices/train-*?h=index" 2>/dev/null | wc -l || echo "0")
|
||||
local infer_indices=$(curl -s "$ES_HOST/_cat/indices/infer-*?h=index" 2>/dev/null | wc -l || echo "0")
|
||||
|
||||
# 创建 train 数据视图
|
||||
if [ "$train_indices" -gt 0 ]; then
|
||||
# 检查数据视图是否已存在
|
||||
local train_exists=$(curl -s "$KB_HOST/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null | grep '"title":"train-\*"' | wc -l )
|
||||
|
||||
if [ "$train_exists" -eq 0 ]; then
|
||||
echo "[INFO] Creating data view for train-* indices"
|
||||
curl -fsS -X POST "$KB_HOST/api/data_views/data_view" \
|
||||
-H 'kbn-xsrf: true' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"data_view":{"name":"train","title":"train-*","timeFieldName":"@timestamp"}}' \
|
||||
>/dev/null && echo "[OK] Created train data view" || echo "[WARN] Failed to create train data view"
|
||||
else
|
||||
echo "[INFO] Train data view already exists, skipping"
|
||||
fi
|
||||
else
|
||||
echo "[INFO] No train-* indices found, skipping train data view creation"
|
||||
fi
|
||||
|
||||
# 创建 infer 数据视图
|
||||
if [ "$infer_indices" -gt 0 ]; then
|
||||
# 检查数据视图是否已存在
|
||||
local infer_exists=$(curl -s "$KB_HOST/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null | grep '"title":"infer-\*"' | wc -l )
|
||||
|
||||
if [ "$infer_exists" -eq 0 ]; then
|
||||
echo "[INFO] Creating data view for infer-* indices"
|
||||
curl -fsS -X POST "$KB_HOST/api/data_views/data_view" \
|
||||
-H 'kbn-xsrf: true' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"data_view":{"name":"infer","title":"infer-*","timeFieldName":"@timestamp"}}' \
|
||||
>/dev/null && echo "[OK] Created infer data view" || echo "[WARN] Failed to create infer data view"
|
||||
else
|
||||
echo "[INFO] Infer data view already exists, skipping"
|
||||
fi
|
||||
else
|
||||
echo "[INFO] No infer-* indices found, skipping infer data view creation"
|
||||
fi
|
||||
}
|
||||
|
||||
# 主逻辑
|
||||
|
||||
@ -32,42 +32,3 @@ fi
|
||||
|
||||
echo "[OK] 初始化完成: private/argus/log/{elasticsearch,kibana}"
|
||||
echo "[INFO] Fluent-bit files should be in fluent-bit/ directory"
|
||||
|
||||
# 准备 Fluent Bit 离线依赖(从 metric all-in-one-full 复制 deb 到 ../fluent-bit/build/packages)
|
||||
FLB_BUILD_PACKAGES_DIR="$root/../fluent-bit/build/packages"
|
||||
mkdir -p "$FLB_BUILD_PACKAGES_DIR"
|
||||
for deb in \
|
||||
"$project_root/src/metric/client-plugins/all-in-one-full/plugins/fluent-bit/bin/libyaml-0-2_"*_amd64.deb \
|
||||
"$project_root/src/metric/client-plugins/all-in-one-full/plugins/fluent-bit/bin/libpq5_"*_amd64.deb ; do
|
||||
if ls $deb >/dev/null 2>&1; then
|
||||
for f in $deb; do
|
||||
base="$(basename "$f")"
|
||||
if [[ ! -f "$FLB_BUILD_PACKAGES_DIR/$base" ]]; then
|
||||
cp "$f" "$FLB_BUILD_PACKAGES_DIR/"
|
||||
echo " [+] copied $base"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
done
|
||||
|
||||
# 额外:从 all-in-one-full 的 ubuntu22/curl.tar.gz 解包必要依赖(libsasl2/ldap),便于离线安装
|
||||
CURLOPT_TAR="$project_root/src/metric/client-plugins/all-in-one-full/deps/ubuntu22/curl.tar.gz"
|
||||
if [[ -f "$CURLOPT_TAR" ]]; then
|
||||
tmpdir=$(mktemp -d)
|
||||
if tar -xzf "$CURLOPT_TAR" -C "$tmpdir" 2>/dev/null; then
|
||||
for p in \
|
||||
libsasl2-2_*_amd64.deb \
|
||||
libsasl2-modules-db_*_amd64.deb \
|
||||
libldap-2.5-0_*_amd64.deb \
|
||||
libidn2-0_*_amd64.deb \
|
||||
libbrotli1_*_amd64.deb \
|
||||
libssl3_*_amd64.deb ; do
|
||||
src=$(ls "$tmpdir"/curl/$p 2>/dev/null | head -n1 || true)
|
||||
if [[ -n "$src" ]]; then
|
||||
base="$(basename "$src")"
|
||||
[[ -f "$FLB_BUILD_PACKAGES_DIR/$base" ]] || cp "$src" "$FLB_BUILD_PACKAGES_DIR/" && echo " [+] staged $base"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
rm -rf "$tmpdir"
|
||||
fi
|
||||
|
||||
@ -28,11 +28,11 @@ fi
|
||||
docker exec "$container_name" mkdir -p /logs/train /logs/infer
|
||||
|
||||
# 写入训练日志 (host01)
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=1 loss=1.23 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host01] training step=2 loss=1.15 model=bert\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
|
||||
# 写入推理日志 (host01)
|
||||
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s ERROR [host01] inference failed on batch=1\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "cat <<'STACK' >> /logs/infer/infer-demo.log
|
||||
Traceback (most recent call last):
|
||||
File \"inference.py\", line 15, in <module>
|
||||
|
||||
@ -28,13 +28,13 @@ fi
|
||||
docker exec "$container_name" mkdir -p /logs/train /logs/infer
|
||||
|
||||
# 写入训练日志 (host02)
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=1 loss=1.45 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=2 loss=1.38 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] training step=3 loss=1.32 model=gpt\n' \"\$(date '+%F %T')\" >> /logs/train/train-demo.log"
|
||||
|
||||
# 写入推理日志 (host02)
|
||||
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date -u +%Y-%m-%dT%H:%M:%SZ)\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s WARN [host02] inference slow on batch=5 latency=2.3s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
|
||||
docker exec "$container_name" sh -c "printf '%s INFO [host02] inference completed batch=6 latency=0.8s\n' \"\$(date '+%F %T')\" >> /logs/infer/infer-demo.log"
|
||||
|
||||
echo "[OK] 已通过docker exec写入测试日志到 host02 容器内:"
|
||||
echo " - /logs/train/train-demo.log"
|
||||
|
||||
@ -115,32 +115,20 @@ show_step "Health" "Check service health"
|
||||
echo "[INFO] Checking service health..."
|
||||
|
||||
# 检查 Elasticsearch 健康状态
|
||||
health_check_ok=1
|
||||
es_health=$(curl -s "http://localhost:9200/_cluster/health" | grep -o '"status":"[^"]*"' | cut -d'"' -f4)
|
||||
if [ "$es_health" = "green" ] || [ "$es_health" = "yellow" ]; then
|
||||
echo "✅ Elasticsearch health: $es_health"
|
||||
else
|
||||
echo "❌ Elasticsearch health: $es_health"
|
||||
health_check_ok=0
|
||||
fi
|
||||
|
||||
# 检查 Kibana 状态
|
||||
if curl -fs "http://localhost:5601/api/status" >/dev/null 2>&1; then
|
||||
kb_status="available"
|
||||
echo "✅ Kibana status: $kb_status"
|
||||
|
||||
data_views_json=$(curl -fs "http://localhost:5601/api/data_views" -H 'kbn-xsrf: true' 2>/dev/null || true)
|
||||
if echo "$data_views_json" | grep -F '"title":"train-*"' >/dev/null 2>&1 && \
|
||||
echo "$data_views_json" | grep -F '"title":"infer-*"' >/dev/null 2>&1; then
|
||||
echo "✅ Kibana data views: train-* and infer-* present"
|
||||
else
|
||||
echo "❌ Kibana data views missing: train-* or infer-*"
|
||||
health_check_ok=0
|
||||
fi
|
||||
else
|
||||
kb_status="unavailable"
|
||||
echo "⚠️ Kibana status: $kb_status"
|
||||
health_check_ok=0
|
||||
fi
|
||||
|
||||
# 检查 Fluent-Bit 指标
|
||||
@ -151,13 +139,6 @@ if [ "$fb_host01_uptime" -gt 0 ] && [ "$fb_host02_uptime" -gt 0 ]; then
|
||||
echo "✅ Fluent-Bit services: host01 uptime=${fb_host01_uptime}s, host02 uptime=${fb_host02_uptime}s"
|
||||
else
|
||||
echo "⚠️ Fluent-Bit services: host01 uptime=${fb_host01_uptime}s, host02 uptime=${fb_host02_uptime}s"
|
||||
health_check_ok=0
|
||||
fi
|
||||
|
||||
if [ "$health_check_ok" -eq 1 ]; then
|
||||
true
|
||||
else
|
||||
false
|
||||
fi
|
||||
|
||||
verify_step "Service health check"
|
||||
|
||||
@ -13,8 +13,6 @@ class AppConfig:
|
||||
scheduler_interval_seconds: int
|
||||
node_id_prefix: str
|
||||
auth_mode: str
|
||||
target_prefer_net_cidrs: str
|
||||
target_reachability_check: bool
|
||||
|
||||
|
||||
def _get_int_env(name: str, default: int) -> int:
|
||||
@ -29,12 +27,6 @@ def _get_int_env(name: str, default: int) -> int:
|
||||
|
||||
def load_config() -> AppConfig:
|
||||
"""读取环境变量生成配置对象,方便统一管理运行参数。"""
|
||||
def _bool_env(name: str, default: bool) -> bool:
|
||||
raw = os.environ.get(name)
|
||||
if raw is None or raw.strip() == "":
|
||||
return default
|
||||
return raw.strip().lower() in ("1", "true", "yes", "on")
|
||||
|
||||
return AppConfig(
|
||||
db_path=os.environ.get("DB_PATH", "/private/argus/master/db.sqlite3"),
|
||||
metric_nodes_json_path=os.environ.get(
|
||||
@ -45,6 +37,4 @@ def load_config() -> AppConfig:
|
||||
scheduler_interval_seconds=_get_int_env("SCHEDULER_INTERVAL_SECONDS", 30),
|
||||
node_id_prefix=os.environ.get("NODE_ID_PREFIX", "A"),
|
||||
auth_mode=os.environ.get("AUTH_MODE", "disabled"),
|
||||
target_prefer_net_cidrs=os.environ.get("TARGET_PREFER_NET_CIDRS", "10.0.0.0/8,172.31.0.0/16"),
|
||||
target_reachability_check=_bool_env("TARGET_REACHABILITY_CHECK", False),
|
||||
)
|
||||
|
||||
@ -1,10 +1,8 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import ipaddress
|
||||
import logging
|
||||
import socket
|
||||
import threading
|
||||
from typing import Optional, Iterable, Dict, Any, List
|
||||
from typing import Optional
|
||||
|
||||
from .config import AppConfig
|
||||
from .storage import Storage
|
||||
@ -36,117 +34,10 @@ class StatusScheduler:
|
||||
self._pending_nodes_json.set()
|
||||
|
||||
def generate_nodes_json(self) -> None:
|
||||
"""根据在线节点生成 Prometheus 抓取目标,优先 overlay IP。
|
||||
|
||||
候选顺序:meta.overlay_ip > hostname A 记录(命中偏好网段)> meta.ip。
|
||||
可选 reachability 检查:TARGET_REACHABILITY_CHECK=true 时,对 9100/9400 做一次 1s TCP 连接测试,
|
||||
选择首个可达的候选;全部失败则按顺序取第一个并记录日志。
|
||||
"""
|
||||
with self._nodes_json_lock:
|
||||
rows = self._storage.get_online_nodes_meta()
|
||||
prefer_cidrs = self._parse_cidrs(self._config.target_prefer_net_cidrs)
|
||||
reachability = self._config.target_reachability_check
|
||||
|
||||
result: List[Dict[str, Any]] = []
|
||||
for row in rows:
|
||||
meta = row.get("meta", {})
|
||||
hostname = meta.get("hostname") or row.get("name")
|
||||
labels = row.get("labels") or []
|
||||
|
||||
overlay_ip = meta.get("overlay_ip")
|
||||
legacy_ip = meta.get("ip")
|
||||
host_candidates = self._resolve_host_ips(hostname)
|
||||
host_pref = self._pick_by_cidrs(host_candidates, prefer_cidrs)
|
||||
|
||||
candidates: List[str] = []
|
||||
for ip in [overlay_ip, host_pref, legacy_ip]:
|
||||
if ip and ip not in candidates:
|
||||
candidates.append(ip)
|
||||
|
||||
chosen = None
|
||||
if reachability:
|
||||
ports = [9100]
|
||||
try:
|
||||
if int(meta.get("gpu_number", 0)) > 0:
|
||||
ports.append(9400)
|
||||
except Exception:
|
||||
pass
|
||||
for ip in candidates:
|
||||
if any(self._reachable(ip, p, 1.0) for p in ports):
|
||||
chosen = ip
|
||||
break
|
||||
if not chosen:
|
||||
chosen = candidates[0] if candidates else legacy_ip
|
||||
if not chosen:
|
||||
# ultimate fallback: 127.0.0.1 (should not happen)
|
||||
chosen = "127.0.0.1"
|
||||
self._logger.warning("No candidate IPs for node; falling back", extra={"node": row.get("node_id")})
|
||||
|
||||
if chosen and ipaddress.ip_address(chosen) in ipaddress.ip_network("172.22.0.0/16"):
|
||||
self._logger.warning(
|
||||
"Prometheus target uses docker_gwbridge address; prefer overlay",
|
||||
extra={"node": row.get("node_id"), "ip": chosen},
|
||||
)
|
||||
|
||||
result.append(
|
||||
{
|
||||
"node_id": row.get("node_id"),
|
||||
"user_id": meta.get("user"),
|
||||
"ip": chosen,
|
||||
"hostname": hostname,
|
||||
"labels": labels if isinstance(labels, list) else [],
|
||||
}
|
||||
)
|
||||
|
||||
atomic_write_json(self._config.metric_nodes_json_path, result)
|
||||
self._logger.info("nodes.json updated", extra={"count": len(result)})
|
||||
|
||||
# ---------------------------- helpers ----------------------------
|
||||
@staticmethod
|
||||
def _parse_cidrs(raw: str) -> List[ipaddress.IPv4Network]:
|
||||
nets: List[ipaddress.IPv4Network] = []
|
||||
for item in (x.strip() for x in (raw or "").split(",")):
|
||||
if not item:
|
||||
continue
|
||||
try:
|
||||
net = ipaddress.ip_network(item, strict=False)
|
||||
if isinstance(net, ipaddress.IPv4Network):
|
||||
nets.append(net)
|
||||
except ValueError:
|
||||
continue
|
||||
return nets
|
||||
|
||||
@staticmethod
|
||||
def _resolve_host_ips(hostname: str) -> List[str]:
|
||||
ips: List[str] = []
|
||||
try:
|
||||
infos = socket.getaddrinfo(hostname, None, family=socket.AF_INET)
|
||||
for info in infos:
|
||||
ip = info[4][0]
|
||||
if ip not in ips:
|
||||
ips.append(ip)
|
||||
except OSError:
|
||||
pass
|
||||
return ips
|
||||
|
||||
@staticmethod
|
||||
def _pick_by_cidrs(candidates: Iterable[str], prefer: List[ipaddress.IPv4Network]) -> str | None:
|
||||
for net in prefer:
|
||||
for ip in candidates:
|
||||
try:
|
||||
if ipaddress.ip_address(ip) in net:
|
||||
return ip
|
||||
except ValueError:
|
||||
continue
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _reachable(ip: str, port: int, timeout: float) -> bool:
|
||||
try:
|
||||
with socket.create_connection((ip, port), timeout=timeout):
|
||||
return True
|
||||
except OSError:
|
||||
return False
|
||||
online_nodes = self._storage.get_online_nodes()
|
||||
atomic_write_json(self._config.metric_nodes_json_path, online_nodes)
|
||||
self._logger.info("nodes.json updated", extra={"count": len(online_nodes)})
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# internal loop
|
||||
|
||||
@ -324,35 +324,9 @@ class Storage:
|
||||
{
|
||||
"node_id": row["id"],
|
||||
"user_id": meta.get("user"),
|
||||
"ip": meta.get("ip"), # kept for backward-compat; preferred IP selection handled in scheduler
|
||||
"ip": meta.get("ip"),
|
||||
"hostname": meta.get("hostname", row["name"]),
|
||||
"labels": labels if isinstance(labels, list) else [],
|
||||
}
|
||||
)
|
||||
return result
|
||||
|
||||
def get_online_nodes_meta(self) -> List[Dict[str, Any]]:
|
||||
"""返回在线节点的原始 meta 与名称、标签,交由上层选择目标 IP。
|
||||
|
||||
每项包含:{ node_id, name, meta, labels }
|
||||
"""
|
||||
with self._lock:
|
||||
cur = self._conn.execute(
|
||||
"SELECT id, name, meta_json, labels_json FROM nodes WHERE status = ? ORDER BY id ASC",
|
||||
("online",),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
|
||||
result: List[Dict[str, Any]] = []
|
||||
for row in rows:
|
||||
meta = json.loads(row["meta_json"]) if row["meta_json"] else {}
|
||||
labels = json.loads(row["labels_json"]) if row["labels_json"] else []
|
||||
result.append(
|
||||
{
|
||||
"node_id": row["id"],
|
||||
"name": row["name"],
|
||||
"meta": meta if isinstance(meta, dict) else {},
|
||||
"labels": labels if isinstance(labels, list) else [],
|
||||
}
|
||||
)
|
||||
return result
|
||||
|
||||
@ -3,13 +3,12 @@ set -euo pipefail
|
||||
|
||||
usage() {
|
||||
cat >&2 <<'USAGE'
|
||||
Usage: $0 [--intranet] [--offline] [--tag <image_tag>] [--no-cache]
|
||||
Usage: $0 [--intranet] [--offline] [--tag <image_tag>]
|
||||
|
||||
Options:
|
||||
--intranet 使用指定的 PyPI 镜像源(默认清华镜像)。
|
||||
--offline 完全离线构建,依赖 offline_wheels/ 目录中的离线依赖包。
|
||||
--tag <image_tag> 自定义镜像标签,默认 argus-master:latest。
|
||||
--no-cache 不使用 Docker 构建缓存。
|
||||
USAGE
|
||||
}
|
||||
|
||||
@ -20,7 +19,6 @@ IMAGE_TAG="${IMAGE_TAG:-argus-master:latest}"
|
||||
DOCKERFILE="src/master/Dockerfile"
|
||||
BUILD_ARGS=()
|
||||
OFFLINE_MODE=0
|
||||
NO_CACHE=0
|
||||
|
||||
source "$PROJECT_ROOT/scripts/common/build_user.sh"
|
||||
load_build_user
|
||||
@ -47,11 +45,6 @@ while [[ "$#" -gt 0 ]]; do
|
||||
IMAGE_TAG="$2"
|
||||
shift 2
|
||||
;;
|
||||
--no-cache)
|
||||
NO_CACHE=1
|
||||
BUILD_ARGS+=("--no-cache")
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
|
||||
2
src/metric/.gitignore
vendored
2
src/metric/.gitignore
vendored
@ -4,4 +4,4 @@
|
||||
/client-plugins/demo-all-in-one/publish/
|
||||
/client-plugins/demo-all-in-one/checklist
|
||||
/client-plugins/demo-all-in-one/VERSION
|
||||
/client-plugins/all-in-one-full/artifact/
|
||||
/client-plugins/all-in-one-full/
|
||||
|
||||
@ -4,15 +4,6 @@
|
||||
|
||||
set -e
|
||||
|
||||
# PID 文件检测,防止重复执行
|
||||
PIDFILE="/var/run/check_health.pid"
|
||||
if [ -f "$PIDFILE" ] && kill -0 $(cat "$PIDFILE") 2>/dev/null; then
|
||||
echo "健康检查脚本已在运行中,跳过本次执行" >&2
|
||||
exit 0
|
||||
fi
|
||||
echo $$ > "$PIDFILE"
|
||||
trap "rm -f $PIDFILE" EXIT
|
||||
|
||||
# 获取脚本所在目录
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
HEALTH_LOG_FILE="$SCRIPT_DIR/.health_log"
|
||||
|
||||
@ -200,22 +200,22 @@ parse_version_info() {
|
||||
VERSION=$(grep '"version"' "$VERSION_FILE_PATH" | sed 's/.*"version": *"\([^"]*\)".*/\1/')
|
||||
BUILD_TIME=$(grep '"build_time"' "$VERSION_FILE_PATH" | sed 's/.*"build_time": *"\([^"]*\)".*/\1/')
|
||||
|
||||
# 解析 artifact_list(跳过字段名本身)
|
||||
grep -A 100 '"artifact_list"' "$VERSION_FILE_PATH" | grep -v '"artifact_list"' | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
|
||||
# 解析 artifact_list
|
||||
grep -A 100 '"artifact_list"' "$VERSION_FILE_PATH" | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
|
||||
component=$(echo "$line" | sed 's/.*"\([^"]*\)":\s*"[^"]*".*/\1/')
|
||||
version=$(echo "$line" | sed 's/.*"[^"]*":\s*"\([^"]*\)".*/\1/')
|
||||
echo "$component:$version" >> "$TEMP_DIR/components.txt"
|
||||
done
|
||||
|
||||
# 解析 checksums(跳过字段名本身)
|
||||
grep -A 100 '"checksums"' "$VERSION_FILE_PATH" | grep -v '"checksums"' | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
|
||||
# 解析 checksums
|
||||
grep -A 100 '"checksums"' "$VERSION_FILE_PATH" | grep -E '^\s*"[^"]+":\s*"[^"]+"' | while read line; do
|
||||
component=$(echo "$line" | sed 's/.*"\([^"]*\)":\s*"[^"]*".*/\1/')
|
||||
checksum=$(echo "$line" | sed 's/.*"[^"]*":\s*"\([^"]*\)".*/\1/')
|
||||
echo "$component:$checksum" >> "$TEMP_DIR/checksums.txt"
|
||||
done
|
||||
|
||||
# 解析 install_order(跳过字段名本身,只取数组元素)
|
||||
grep -A 100 '"install_order"' "$VERSION_FILE_PATH" | grep -v '"install_order"' | grep -E '^\s*"[^"]+"' | while read line; do
|
||||
# 解析 install_order
|
||||
grep -A 100 '"install_order"' "$VERSION_FILE_PATH" | grep -E '^\s*"[^"]+"' | while read line; do
|
||||
component=$(echo "$line" | sed 's/.*"\([^"]*\)".*/\1/')
|
||||
echo "$component" >> "$TEMP_DIR/install_order.txt"
|
||||
done
|
||||
@ -317,152 +317,85 @@ create_install_dirs() {
|
||||
log_success "安装目录创建完成: $INSTALL_DIR"
|
||||
}
|
||||
|
||||
# 获取系统版本
|
||||
get_system_version() {
|
||||
if [[ ! -f /etc/os-release ]]; then
|
||||
log_error "无法检测操作系统版本"
|
||||
return 1
|
||||
fi
|
||||
|
||||
source /etc/os-release
|
||||
|
||||
# 提取主版本号
|
||||
case "$VERSION_ID" in
|
||||
"20.04")
|
||||
echo "ubuntu20"
|
||||
;;
|
||||
"22.04")
|
||||
echo "ubuntu22"
|
||||
;;
|
||||
*)
|
||||
log_warning "未识别的Ubuntu版本: $VERSION_ID,尝试使用ubuntu22"
|
||||
echo "ubuntu22"
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# 安装系统依赖包
|
||||
install_system_deps() {
|
||||
log_info "检查系统依赖包..."
|
||||
|
||||
|
||||
local script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
local deps_dir="$script_dir/deps"
|
||||
|
||||
|
||||
# 检查deps目录是否存在
|
||||
if [[ ! -d "$deps_dir" ]]; then
|
||||
log_info "deps 目录不存在,跳过系统依赖包安装"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# 获取系统版本对应的依赖目录
|
||||
local system_version=$(get_system_version)
|
||||
local version_deps_dir="$deps_dir/$system_version"
|
||||
|
||||
log_info "检测到系统版本: $system_version"
|
||||
|
||||
# 检查版本特定的依赖目录是否存在
|
||||
if [[ ! -d "$version_deps_dir" ]]; then
|
||||
log_warning "未找到 $system_version 版本的依赖目录: $version_deps_dir"
|
||||
# 回退到旧的逻辑,检查根deps目录
|
||||
local deps_count=$(find "$deps_dir" -name "*.tar.gz" | wc -l)
|
||||
if [[ $deps_count -eq 0 ]]; then
|
||||
log_info "deps 目录中没有 tar.gz 文件,跳过系统依赖包安装"
|
||||
return 0
|
||||
fi
|
||||
version_deps_dir="$deps_dir"
|
||||
else
|
||||
# 检查版本目录中是否有tar.gz文件
|
||||
local deps_count=$(find "$version_deps_dir" -name "*.tar.gz" | wc -l)
|
||||
if [[ $deps_count -eq 0 ]]; then
|
||||
log_info "$system_version 版本目录中没有 tar.gz 文件,跳过系统依赖包安装"
|
||||
return 0
|
||||
fi
|
||||
# 检查是否有tar.gz文件
|
||||
local deps_count=$(find "$deps_dir" -name "*.tar.gz" | wc -l)
|
||||
if [[ $deps_count -eq 0 ]]; then
|
||||
log_info "deps 目录中没有 tar.gz 文件,跳过系统依赖包安装"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log_info "找到 $system_version 版本的依赖包,开始安装..."
|
||||
|
||||
|
||||
log_info "找到 $deps_count 个系统依赖包,开始安装..."
|
||||
|
||||
# 创建临时目录用于解压依赖包
|
||||
local deps_temp_dir="${TEMP_DIR:-/tmp}/deps"
|
||||
local deps_temp_dir="$TEMP_DIR/deps"
|
||||
mkdir -p "$deps_temp_dir"
|
||||
|
||||
# 定义要检查的核心依赖
|
||||
local CORE_DEPS=(jq cron curl)
|
||||
local FAILED_DEPS=()
|
||||
|
||||
|
||||
# 处理每个tar.gz文件
|
||||
find "$version_deps_dir" -name "*.tar.gz" | while read tar_file; do
|
||||
find "$deps_dir" -name "*.tar.gz" | while read tar_file; do
|
||||
local tar_basename=$(basename "$tar_file")
|
||||
local extract_name="${tar_basename%.tar.gz}"
|
||||
|
||||
|
||||
log_info "处理依赖包: $tar_basename"
|
||||
|
||||
|
||||
# 解压到临时目录
|
||||
local extract_dir="$deps_temp_dir/$extract_name"
|
||||
mkdir -p "$extract_dir"
|
||||
|
||||
|
||||
if tar -xzf "$tar_file" -C "$extract_dir" 2>/dev/null; then
|
||||
log_success " $tar_basename 解压完成"
|
||||
else
|
||||
log_error " $tar_basename 解压失败"
|
||||
continue
|
||||
fi
|
||||
|
||||
|
||||
# 进入解压目录,查找deb包
|
||||
cd "$extract_dir" || continue
|
||||
local deb_files=(*.deb)
|
||||
if [[ ${#deb_files[@]} -gt 0 ]]; then
|
||||
log_info " 找到 ${#deb_files[@]} 个 deb 包,开始安装..."
|
||||
|
||||
for deb in "${deb_files[@]}"; do
|
||||
local pkg_name
|
||||
pkg_name=$(dpkg-deb -f "$deb" Package 2>/dev/null)
|
||||
|
||||
# 如果已安装,则跳过
|
||||
if dpkg -s "$pkg_name" &>/dev/null; then
|
||||
log_success " $pkg_name 已安装,跳过"
|
||||
continue
|
||||
fi
|
||||
|
||||
# 尝试安装
|
||||
log_info " 安装 $pkg_name..."
|
||||
if DEBIAN_FRONTEND=noninteractive dpkg -i "$deb" &>/dev/null; then
|
||||
log_success " $pkg_name 安装成功"
|
||||
cd "$extract_dir"
|
||||
local deb_count=$(find . -name "*.deb" | wc -l)
|
||||
|
||||
if [[ $deb_count -gt 0 ]]; then
|
||||
log_info " 找到 $deb_count 个 deb 包,开始安装..."
|
||||
|
||||
# 1. 先尝试安装所有deb包
|
||||
log_info " 第1步:批量安装deb包..."
|
||||
if dpkg -i *.deb 2>/dev/null; then
|
||||
log_success " 所有deb包安装成功"
|
||||
else
|
||||
log_warning " 部分deb包安装失败,可能存在依赖问题"
|
||||
|
||||
# 2. 使用apt-get修复依赖
|
||||
log_info " 第2步:修复依赖关系..."
|
||||
if apt-get install -f -y; then
|
||||
log_success " 依赖关系修复完成"
|
||||
else
|
||||
log_warning " $pkg_name 安装失败,尝试修复依赖..."
|
||||
if DEBIAN_FRONTEND=noninteractive apt-get install -f -y &>/dev/null; then
|
||||
if dpkg -s "$pkg_name" &>/dev/null; then
|
||||
log_success " $pkg_name 修复安装成功"
|
||||
else
|
||||
log_error " $pkg_name 仍未安装成功"
|
||||
FAILED_DEPS+=("$pkg_name")
|
||||
fi
|
||||
else
|
||||
log_error " $pkg_name 自动修复失败"
|
||||
FAILED_DEPS+=("$pkg_name")
|
||||
fi
|
||||
log_error " 依赖关系修复失败"
|
||||
# 继续处理其他包,不退出
|
||||
fi
|
||||
done
|
||||
fi
|
||||
else
|
||||
log_info " $tar_basename 中没有找到deb包,跳过"
|
||||
fi
|
||||
|
||||
|
||||
# 返回到依赖临时目录
|
||||
cd "$deps_temp_dir" || continue
|
||||
cd "$deps_temp_dir"
|
||||
done
|
||||
|
||||
|
||||
# 检查并启动 cron 服务
|
||||
start_cron_service
|
||||
|
||||
# 总结安装结果
|
||||
if [[ ${#FAILED_DEPS[@]} -gt 0 ]]; then
|
||||
log_error "以下系统依赖未能成功安装,安装终止,请手动安装后重试:"
|
||||
for f in "${FAILED_DEPS[@]}"; do
|
||||
echo " - $f"
|
||||
done
|
||||
exit 1
|
||||
else
|
||||
log_success "系统依赖包安装完成,全部就绪"
|
||||
fi
|
||||
|
||||
log_success "系统依赖包安装完成"
|
||||
}
|
||||
|
||||
# 启动 cron 服务
|
||||
@ -704,18 +637,6 @@ EOF
|
||||
log_success "安装记录已创建: $install_record_file"
|
||||
}
|
||||
|
||||
# 检查cron任务是否已存在
|
||||
check_cron_task_exists() {
|
||||
local task_pattern="$1"
|
||||
local temp_cron="$2"
|
||||
|
||||
if grep -q "$task_pattern" "$temp_cron"; then
|
||||
return 0 # 任务已存在
|
||||
else
|
||||
return 1 # 任务不存在
|
||||
fi
|
||||
}
|
||||
|
||||
# 设置健康检查定时任务
|
||||
setup_health_check_cron() {
|
||||
log_info "设置健康检查定时任务..."
|
||||
@ -740,7 +661,7 @@ setup_health_check_cron() {
|
||||
crontab -l 2>/dev/null > "$temp_cron" || touch "$temp_cron"
|
||||
|
||||
# 检查并删除旧的健康检查任务
|
||||
if check_cron_task_exists "check_health.sh" "$temp_cron"; then
|
||||
if grep -q "check_health.sh" "$temp_cron"; then
|
||||
log_info "发现旧的健康检查定时任务,正在更新..."
|
||||
# 删除所有包含check_health.sh的行
|
||||
grep -v "check_health.sh" "$temp_cron" > "$temp_cron.new"
|
||||
@ -795,7 +716,7 @@ setup_dns_sync_cron() {
|
||||
crontab -l 2>/dev/null > "$temp_cron" || touch "$temp_cron"
|
||||
|
||||
# 检查并删除旧的 DNS 同步任务
|
||||
if check_cron_task_exists "sync_dns.sh" "$temp_cron"; then
|
||||
if grep -q "sync_dns.sh" "$temp_cron"; then
|
||||
log_info "发现旧的 DNS 同步定时任务,正在更新..."
|
||||
# 删除所有包含sync_dns.sh的行
|
||||
grep -v "sync_dns.sh" "$temp_cron" > "$temp_cron.new"
|
||||
@ -803,15 +724,16 @@ setup_dns_sync_cron() {
|
||||
log_info "旧的 DNS 同步定时任务已删除"
|
||||
fi
|
||||
|
||||
# 添加新的定时任务(每1分钟执行一次)
|
||||
# 添加新的定时任务(每30秒执行一次)
|
||||
# 直接使用版本目录中的 DNS 同步脚本
|
||||
echo "# Argus-Metrics DNS 同步定时任务" >> "$temp_cron"
|
||||
echo "* * * * * $sync_dns_script >> $INSTALL_DIR/.dns_sync.log 2>&1" >> "$temp_cron"
|
||||
echo "* * * * * sleep 30; $sync_dns_script >> $INSTALL_DIR/.dns_sync.log 2>&1" >> "$temp_cron"
|
||||
|
||||
# 安装新的crontab
|
||||
if crontab "$temp_cron"; then
|
||||
log_success "DNS 同步定时任务设置成功"
|
||||
log_info " 执行频率: 每1分钟"
|
||||
log_info " 执行频率: 每30秒"
|
||||
log_info " 日志文件: $INSTALL_DIR/.dns_sync.log"
|
||||
log_info " 查看定时任务: crontab -l"
|
||||
log_info " 删除定时任务: crontab -e"
|
||||
@ -849,7 +771,7 @@ setup_version_check_cron() {
|
||||
crontab -l > "$temp_cron" 2>/dev/null || touch "$temp_cron"
|
||||
|
||||
# 检查是否已存在版本校验定时任务
|
||||
if check_cron_task_exists "check_version.sh" "$temp_cron"; then
|
||||
if grep -q "check_version.sh" "$temp_cron"; then
|
||||
log_info "发现旧的版本校验定时任务,正在更新..."
|
||||
# 删除所有包含check_version.sh的行
|
||||
grep -v "check_version.sh" "$temp_cron" > "$temp_cron.new"
|
||||
@ -902,7 +824,7 @@ setup_restart_cron() {
|
||||
crontab -l > "$temp_cron" 2>/dev/null || touch "$temp_cron"
|
||||
|
||||
# 检查是否已存在自动重启定时任务
|
||||
if check_cron_task_exists "restart_unhealthy.sh" "$temp_cron"; then
|
||||
if grep -q "restart_unhealthy.sh" "$temp_cron"; then
|
||||
log_info "发现旧的自动重启定时任务,正在更新..."
|
||||
# 删除所有包含restart_unhealthy.sh的行
|
||||
grep -v "restart_unhealthy.sh" "$temp_cron" > "$temp_cron.new"
|
||||
@ -963,9 +885,9 @@ main() {
|
||||
check_system
|
||||
find_version_file
|
||||
create_install_dirs
|
||||
install_system_deps
|
||||
parse_version_info
|
||||
parse_version_info
|
||||
verify_checksums
|
||||
install_system_deps
|
||||
install_components
|
||||
copy_config_files
|
||||
create_install_record
|
||||
@ -973,20 +895,6 @@ main() {
|
||||
setup_dns_sync_cron
|
||||
setup_version_check_cron
|
||||
setup_restart_cron
|
||||
|
||||
# 注释掉立即执行健康检查,避免与cron任务重复执行
|
||||
# log_info "立即执行一次健康检查..."
|
||||
# local check_health_script="$INSTALL_DIR/check_health.sh"
|
||||
# if [[ -f "$check_health_script" ]]; then
|
||||
# if "$check_health_script" >> "$INSTALL_DIR/.health_check.log" 2>&1; then
|
||||
# log_success "健康检查执行完成"
|
||||
# else
|
||||
# log_warning "健康检查执行失败,请检查日志: $INSTALL_DIR/.health_check.log"
|
||||
# fi
|
||||
# else
|
||||
# log_warning "健康检查脚本不存在: $check_health_script"
|
||||
# fi
|
||||
|
||||
show_install_info
|
||||
}
|
||||
|
||||
|
||||
@ -29,68 +29,26 @@ log_error() {
|
||||
show_help() {
|
||||
echo "Argus-Metric Artifact 发布脚本"
|
||||
echo
|
||||
echo "用法: $0 <版本号> [选项]"
|
||||
echo "用法: $0 <版本号>"
|
||||
echo
|
||||
echo "参数:"
|
||||
echo " <版本号> 要发布的版本号,对应 artifact 目录中的版本"
|
||||
echo
|
||||
echo "选项:"
|
||||
echo " --output-dir <路径> 指定输出目录 (默认: /private/argus/ftp/share/)"
|
||||
echo " --owner <uid:gid> 指定文件所有者 (默认: 2133:2015)"
|
||||
echo " -h, --help 显示此帮助信息"
|
||||
echo " <版本号> 要发布的版本号,对应 artifact 目录中的版本"
|
||||
echo
|
||||
echo "示例:"
|
||||
echo " $0 1.20.0 # 使用默认配置发布"
|
||||
echo " $0 1.20.0 --output-dir /tmp/publish # 指定输出目录"
|
||||
echo " $0 1.20.0 --owner 1000:1000 # 指定文件所有者"
|
||||
echo " $0 1.20.0 --output-dir /srv/ftp --owner root:root # 同时指定两者"
|
||||
echo " $0 1.20.0 # 发布 1.20.0 版本"
|
||||
echo
|
||||
}
|
||||
|
||||
# 默认配置
|
||||
DEFAULT_PUBLISH_DIR="/private/argus/ftp/share/"
|
||||
DEFAULT_OWNER="2133:2015"
|
||||
|
||||
# 解析参数
|
||||
VERSION=""
|
||||
PUBLISH_DIR="$DEFAULT_PUBLISH_DIR"
|
||||
OWNER="$DEFAULT_OWNER"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-h|--help)
|
||||
show_help
|
||||
exit 0
|
||||
;;
|
||||
--output-dir)
|
||||
PUBLISH_DIR="$2"
|
||||
shift 2
|
||||
;;
|
||||
--owner)
|
||||
OWNER="$2"
|
||||
shift 2
|
||||
;;
|
||||
*)
|
||||
if [[ -z "$VERSION" ]]; then
|
||||
VERSION="$1"
|
||||
shift
|
||||
else
|
||||
log_error "未知参数: $1"
|
||||
show_help
|
||||
exit 1
|
||||
fi
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# 检查版本号是否提供
|
||||
if [[ -z "$VERSION" ]]; then
|
||||
# 检查参数
|
||||
if [[ $# -ne 1 ]]; then
|
||||
log_error "请提供版本号参数"
|
||||
show_help
|
||||
exit 1
|
||||
fi
|
||||
|
||||
VERSION="$1"
|
||||
ARTIFACT_DIR="artifact/$VERSION"
|
||||
PUBLISH_DIR="/Users/sundapeng/Project/nlp/aiops/client-plugins/all-in-one/publish/"
|
||||
|
||||
# 检查版本目录是否存在
|
||||
if [[ ! -d "$ARTIFACT_DIR" ]]; then
|
||||
@ -99,32 +57,11 @@ if [[ ! -d "$ARTIFACT_DIR" ]]; then
|
||||
fi
|
||||
|
||||
log_info "开始发布版本: $VERSION"
|
||||
log_info "输出目录: $PUBLISH_DIR"
|
||||
log_info "文件所有者: $OWNER"
|
||||
|
||||
# 确保发布目录存在
|
||||
log_info "确保发布目录存在: $PUBLISH_DIR"
|
||||
mkdir -p "$PUBLISH_DIR"
|
||||
|
||||
IFS=':' read -r OWNER_UID OWNER_GID <<< "$OWNER"
|
||||
if [[ -z "$OWNER_UID" || -z "$OWNER_GID" ]]; then
|
||||
log_error "--owner 格式不正确,应为 uid:gid"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
CURRENT_UID=$(id -u)
|
||||
CURRENT_GID=$(id -g)
|
||||
if [[ "$OWNER_UID" != "$CURRENT_UID" || "$OWNER_GID" != "$CURRENT_GID" ]]; then
|
||||
if [[ "$CURRENT_UID" -ne 0 ]]; then
|
||||
log_error "当前用户 (${CURRENT_UID}:${CURRENT_GID}) 无法设置所有者为 ${OWNER_UID}:${OWNER_GID}"
|
||||
log_error "请以目标用户运行脚本或预先调整目录权限"
|
||||
exit 1
|
||||
fi
|
||||
NEED_CHOWN=true
|
||||
else
|
||||
NEED_CHOWN=false
|
||||
fi
|
||||
|
||||
# 创建临时目录用于打包
|
||||
TEMP_PACKAGE_DIR="/tmp/argus-metric-package-$$"
|
||||
mkdir -p "$TEMP_PACKAGE_DIR"
|
||||
@ -230,28 +167,17 @@ cd "$TEMP_PACKAGE_DIR"
|
||||
tar -czf "$PUBLISH_DIR/$TAR_NAME" *
|
||||
cd - > /dev/null
|
||||
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
log_info "设置文件所有者为: $OWNER"
|
||||
chown "$OWNER" "$PUBLISH_DIR/$TAR_NAME"
|
||||
fi
|
||||
|
||||
# 清理临时目录
|
||||
rm -rf "$TEMP_PACKAGE_DIR"
|
||||
|
||||
# 更新 LATEST_VERSION 文件
|
||||
log_info "更新 LATEST_VERSION 文件..."
|
||||
echo "$VERSION" > "$PUBLISH_DIR/LATEST_VERSION"
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
chown "$OWNER" "$PUBLISH_DIR/LATEST_VERSION"
|
||||
fi
|
||||
|
||||
# 复制 DNS 配置文件到发布目录根目录(直接从 config 目录复制)
|
||||
if [[ -f "config/dns.conf" ]]; then
|
||||
log_info "复制 DNS 配置文件到发布目录根目录..."
|
||||
cp "config/dns.conf" "$PUBLISH_DIR/"
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
chown "$OWNER" "$PUBLISH_DIR/dns.conf"
|
||||
fi
|
||||
log_success "DNS 配置文件复制完成: $PUBLISH_DIR/dns.conf"
|
||||
else
|
||||
log_warning "未找到 config/dns.conf 文件,跳过 DNS 配置文件复制"
|
||||
@ -261,9 +187,6 @@ fi
|
||||
if [[ -f "scripts/setup.sh" ]]; then
|
||||
log_info "复制 setup.sh 到发布目录..."
|
||||
cp "scripts/setup.sh" "$PUBLISH_DIR/"
|
||||
if [[ "$NEED_CHOWN" == true ]]; then
|
||||
chown "$OWNER" "$PUBLISH_DIR/setup.sh"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 显示发布结果
|
||||
|
||||
@ -2,15 +2,6 @@
|
||||
|
||||
# 此脚本会检查各组件的健康状态,并重启不健康的组件
|
||||
|
||||
# PID 文件检测,防止重复执行
|
||||
PIDFILE="/var/run/restart_unhealthy.pid"
|
||||
if [ -f "$PIDFILE" ] && kill -0 $(cat "$PIDFILE") 2>/dev/null; then
|
||||
echo "自动重启脚本已在运行中,跳过本次执行" >&2
|
||||
exit 0
|
||||
fi
|
||||
echo $$ > "$PIDFILE"
|
||||
trap "rm -f $PIDFILE" EXIT
|
||||
|
||||
# 获取脚本所在目录
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
INSTALL_RECORD_FILE="$SCRIPT_DIR/.install_record"
|
||||
|
||||
@ -1,143 +1,244 @@
|
||||
#!/bin/bash
|
||||
|
||||
# DNS 同步脚本
|
||||
# 比较 FTP 根目录的 dns.conf 和本地的 dns.conf,如果有变化则同步到 /etc/resolv.conf
|
||||
|
||||
set -e
|
||||
|
||||
# 颜色
|
||||
RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
|
||||
# 颜色定义
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# 日志函数
|
||||
log_info() { echo -e "${BLUE}[INFO]${NC} $1" >&2; }
|
||||
log_success() { echo -e "${GREEN}[SUCCESS]${NC} $1" >&2; }
|
||||
log_warning() { echo -e "${YELLOW}[WARNING]${NC} $1" >&2; }
|
||||
log_error() { echo -e "${RED}[ERROR]${NC} $1" >&2; }
|
||||
# 日志函数 - 输出到 stderr 避免影响函数返回值
|
||||
log_info() {
|
||||
echo -e "${BLUE}[INFO]${NC} $1" >&2
|
||||
}
|
||||
|
||||
log_success() {
|
||||
echo -e "${GREEN}[SUCCESS]${NC} $1" >&2
|
||||
}
|
||||
|
||||
log_warning() {
|
||||
echo -e "${YELLOW}[WARNING]${NC} $1" >&2
|
||||
}
|
||||
|
||||
log_error() {
|
||||
echo -e "${RED}[ERROR]${NC} $1" >&2
|
||||
}
|
||||
|
||||
# 获取脚本所在目录
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
LOCAL_DNS_CONF="/opt/argus-metric/dns.conf"
|
||||
RESOLV_CONF="/etc/resolv.conf"
|
||||
ALT_RESOLV_CONF="/run/resolv.conf"
|
||||
LOG_FILE="/opt/argus-metric/.dns_sync.log"
|
||||
REMOTE_DNS_CONF_URL=""
|
||||
RESOLV_CONF="/etc/resolv.conf"
|
||||
LOG_FILE="/opt/argus-metric/.dns_sync.log"
|
||||
|
||||
# 获取 FTP 配置
|
||||
# 从环境变量或配置文件获取 FTP 服务器信息
|
||||
get_ftp_config() {
|
||||
# 优先从环境变量获取配置
|
||||
log_info "获取 FTP 配置信息..."
|
||||
|
||||
# 如果环境变量中没有设置,则尝试从配置文件读取
|
||||
if [[ -z "$FTP_SERVER" || -z "$FTP_USER" || -z "$FTP_PASSWORD" ]]; then
|
||||
[[ -f "$SCRIPT_DIR/config.env" ]] && source "$SCRIPT_DIR/config.env"
|
||||
local config_file="$SCRIPT_DIR/config.env"
|
||||
if [[ -f "$config_file" ]]; then
|
||||
log_info "从配置文件读取 FTP 配置: $config_file"
|
||||
source "$config_file"
|
||||
fi
|
||||
else
|
||||
log_info "使用环境变量中的 FTP 配置"
|
||||
fi
|
||||
|
||||
# 设置默认值(如果环境变量和配置文件都没有设置)
|
||||
FTP_SERVER="${FTP_SERVER:-localhost}"
|
||||
FTP_USER="${FTP_USER:-ftpuser}"
|
||||
FTP_PASSWORD="${FTP_PASSWORD:-ZGClab1234!}"
|
||||
|
||||
# 构建远程 DNS 配置文件 URL
|
||||
REMOTE_DNS_CONF_URL="ftp://${FTP_USER}:${FTP_PASSWORD}@${FTP_SERVER}/dns.conf"
|
||||
|
||||
log_info "FTP 配置来源: ${FTP_CONFIG_SOURCE:-环境变量/配置文件}"
|
||||
}
|
||||
|
||||
# 下载远程 dns.conf
|
||||
# 下载远程 DNS 配置文件
|
||||
download_remote_dns_conf() {
|
||||
local tmp="/tmp/dns.remote.$$"
|
||||
local temp_file="/tmp/dns.conf.remote.$$"
|
||||
|
||||
log_info "从 FTP 服务器下载 DNS 配置文件..."
|
||||
log_info "远程地址: $REMOTE_DNS_CONF_URL"
|
||||
log_info "FTP 服务器: $FTP_SERVER"
|
||||
log_info "FTP 用户: $FTP_USER"
|
||||
|
||||
# 先测试 FTP 连接
|
||||
log_info "测试 FTP 连接..."
|
||||
if ! curl -u "${FTP_USER}:${FTP_PASSWORD}" -sfI "ftp://${FTP_SERVER}/" >/dev/null; then
|
||||
log_error "无法连接到 FTP 服务器: $FTP_SERVER"; return 1
|
||||
fi
|
||||
if ! curl -u "${FTP_USER}:${FTP_PASSWORD}" -sf "ftp://${FTP_SERVER}/dns.conf" -o "$tmp" 2>/dev/null; then
|
||||
log_error "下载 dns.conf 失败"; rm -f "$tmp"; return 1
|
||||
fi
|
||||
echo "$tmp"
|
||||
}
|
||||
|
||||
# 文件比较
|
||||
compare_files() { diff -q "$1" "$2" >/dev/null 2>&1; }
|
||||
|
||||
# 从 dns.conf 提取有效 IP
|
||||
get_dns_ips() {
|
||||
grep -Eo '^[0-9]{1,3}(\.[0-9]{1,3}){3}$' "$1" | sort -u
|
||||
}
|
||||
|
||||
# 安全更新 resolv.conf(保留符号链接)
|
||||
update_resolv_conf() {
|
||||
local dns_conf="$1"
|
||||
local dns_ips
|
||||
mapfile -t dns_ips < <(get_dns_ips "$dns_conf")
|
||||
[[ ${#dns_ips[@]} -eq 0 ]] && { log_warning "未检测到有效 DNS"; return; }
|
||||
|
||||
local target_file="$RESOLV_CONF"
|
||||
if [[ ! -w "$RESOLV_CONF" ]]; then
|
||||
log_warning "/etc/resolv.conf 不可写,使用兜底路径 $ALT_RESOLV_CONF"
|
||||
target_file="$ALT_RESOLV_CONF"
|
||||
fi
|
||||
|
||||
local temp="/tmp/resolv.new.$$"
|
||||
cp "$target_file" "${target_file}.backup.$(date +%Y%m%d_%H%M%S)" 2>/dev/null || true
|
||||
log_info "更新 DNS 配置文件: $target_file"
|
||||
|
||||
# 写入新的 nameserver 行
|
||||
for ip in "${dns_ips[@]}"; do
|
||||
echo "nameserver $ip"
|
||||
done >"$temp"
|
||||
|
||||
# 追加原内容(去掉重复 nameserver)
|
||||
grep -v '^nameserver' "$target_file" >>"$temp" 2>/dev/null || true
|
||||
awk '!a[$0]++' "$temp" >"${temp}.uniq"
|
||||
|
||||
# ⚙️ 使用 cat 原地覆盖,避免 mv 引发 “设备忙”
|
||||
if cat "${temp}.uniq" >"$target_file" 2>/dev/null; then
|
||||
chmod 644 "$target_file"
|
||||
log_success "DNS 更新完成: ${dns_ips[*]}"
|
||||
if curl -u "${FTP_USER}:${FTP_PASSWORD}" -sfI "ftp://${FTP_SERVER}/" >/dev/null 2>&1; then
|
||||
log_success "FTP 服务器连接成功"
|
||||
else
|
||||
log_error "无法写入 $target_file,可能被系统锁定"
|
||||
log_error "无法连接到 FTP 服务器: $FTP_SERVER"
|
||||
log_error "请检查:"
|
||||
log_error " 1. FTP 服务器是否运行"
|
||||
log_error " 2. 网络连接是否正常"
|
||||
log_error " 3. 服务器地址是否正确"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# 测试 dns.conf 文件是否存在
|
||||
log_info "检查远程 dns.conf 文件是否存在..."
|
||||
if curl -u "${FTP_USER}:${FTP_PASSWORD}" -sfI "ftp://${FTP_SERVER}/dns.conf" >/dev/null 2>&1; then
|
||||
log_success "远程 dns.conf 文件存在"
|
||||
else
|
||||
log_error "远程 dns.conf 文件不存在或无法访问"
|
||||
log_error "请检查 FTP 服务器根目录下是否有 dns.conf 文件"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# 尝试下载文件
|
||||
log_info "开始下载 dns.conf 文件..."
|
||||
if curl -u "${FTP_USER}:${FTP_PASSWORD}" -sf "ftp://${FTP_SERVER}/dns.conf" -o "$temp_file" 2>/dev/null; then
|
||||
log_success "远程 DNS 配置文件下载成功"
|
||||
echo "$temp_file"
|
||||
else
|
||||
log_error "下载 dns.conf 文件失败"
|
||||
log_error "尝试手动测试命令:"
|
||||
log_error " curl -u ${FTP_USER}:${FTP_PASSWORD} ftp://${FTP_SERVER}/dns.conf"
|
||||
rm -f "$temp_file"
|
||||
return 1
|
||||
fi
|
||||
|
||||
rm -f "$temp" "${temp}.uniq"
|
||||
}
|
||||
|
||||
# 检查 resolv.conf 是否包含 dns.conf 内容
|
||||
ensure_dns_in_resolv() {
|
||||
local dns_conf="$1"
|
||||
local dns_ips
|
||||
mapfile -t dns_ips < <(get_dns_ips "$dns_conf")
|
||||
[[ ${#dns_ips[@]} -eq 0 ]] && return
|
||||
# 比较两个文件是否相同
|
||||
compare_files() {
|
||||
local file1="$1"
|
||||
local file2="$2"
|
||||
|
||||
if [[ ! -f "$file1" || ! -f "$file2" ]]; then
|
||||
return 1
|
||||
fi
|
||||
|
||||
# 使用 diff 比较文件内容
|
||||
if diff -q "$file1" "$file2" >/dev/null 2>&1; then
|
||||
return 0 # 文件相同
|
||||
else
|
||||
return 1 # 文件不同
|
||||
fi
|
||||
}
|
||||
|
||||
for ip in "${dns_ips[@]}"; do
|
||||
if ! grep -q "nameserver $ip" "$RESOLV_CONF" 2>/dev/null; then
|
||||
log_warning "检测到 /etc/resolv.conf 缺少 $ip,执行兜底修复"
|
||||
update_resolv_conf "$dns_conf"
|
||||
return
|
||||
# 将 DNS 配置追加到 /etc/resolv.conf
|
||||
update_resolv_conf() {
|
||||
local dns_conf_file="$1"
|
||||
|
||||
log_info "更新 /etc/resolv.conf 文件..."
|
||||
|
||||
# 备份原始文件
|
||||
if [[ -f "$RESOLV_CONF" ]]; then
|
||||
cp "$RESOLV_CONF" "${RESOLV_CONF}.backup.$(date +%Y%m%d_%H%M%S)"
|
||||
log_info "已备份原始 resolv.conf 文件"
|
||||
fi
|
||||
|
||||
# 读取 DNS 配置文件并追加到 resolv.conf
|
||||
while IFS= read -r line; do
|
||||
# 跳过空行和注释行
|
||||
[[ -z "$line" || "$line" =~ ^[[:space:]]*# ]] && continue
|
||||
|
||||
# 验证是否为有效的 IP 地址
|
||||
if [[ "$line" =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
|
||||
# 检查是否已存在相同的 nameserver 行
|
||||
if ! grep -q "nameserver $line" "$RESOLV_CONF" 2>/dev/null; then
|
||||
echo "nameserver $line" >> "$RESOLV_CONF"
|
||||
log_info "添加 DNS 服务器: $line"
|
||||
else
|
||||
log_info "DNS 服务器已存在,跳过: $line"
|
||||
fi
|
||||
else
|
||||
log_warning "跳过无效的 DNS 地址: $line"
|
||||
fi
|
||||
done
|
||||
log_info "/etc/resolv.conf 已包含所有 DNS"
|
||||
done < "$dns_conf_file"
|
||||
|
||||
# 设置文件权限
|
||||
chmod 644 "$RESOLV_CONF"
|
||||
|
||||
log_success "/etc/resolv.conf 文件更新完成"
|
||||
}
|
||||
|
||||
log_sync() { echo "[$(date '+%F %T')] $1" >>"$LOG_FILE"; }
|
||||
# 记录同步日志
|
||||
log_sync() {
|
||||
local message="$1"
|
||||
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
echo "[$timestamp] $message" >> "$LOG_FILE"
|
||||
}
|
||||
|
||||
# 主函数
|
||||
main() {
|
||||
log_info "开始 DNS 同步检查..."
|
||||
mkdir -p /opt/argus-metric
|
||||
|
||||
log_sync "DNS 同步检查开始"
|
||||
|
||||
# 确保系统目录存在
|
||||
mkdir -p "/opt/argus-metric"
|
||||
|
||||
# 获取 FTP 配置
|
||||
get_ftp_config
|
||||
local remote_file
|
||||
if ! remote_file=$(download_remote_dns_conf); then
|
||||
log_error "下载失败"; log_sync "同步失败"; exit 1
|
||||
fi
|
||||
|
||||
|
||||
# 检查本地 DNS 配置文件是否存在
|
||||
if [[ ! -f "$LOCAL_DNS_CONF" ]]; then
|
||||
log_info "本地 dns.conf 不存在,初始化..."
|
||||
cp "$remote_file" "$LOCAL_DNS_CONF"
|
||||
update_resolv_conf "$LOCAL_DNS_CONF"
|
||||
log_sync "首次同步完成"
|
||||
else
|
||||
if compare_files "$LOCAL_DNS_CONF" "$remote_file"; then
|
||||
log_info "dns.conf 无变化"
|
||||
ensure_dns_in_resolv "$LOCAL_DNS_CONF"
|
||||
log_sync "dns.conf 无变化,执行兜底检查"
|
||||
else
|
||||
log_info "检测到 DNS 配置更新"
|
||||
log_warning "本地 DNS 配置文件不存在: $LOCAL_DNS_CONF"
|
||||
log_warning "将下载远程配置文件并更新系统 DNS 设置"
|
||||
|
||||
# 下载远程配置文件
|
||||
if remote_file=$(download_remote_dns_conf); then
|
||||
# 复制到本地
|
||||
cp "$remote_file" "$LOCAL_DNS_CONF"
|
||||
log_success "远程 DNS 配置文件已保存到本地"
|
||||
|
||||
# 更新 resolv.conf
|
||||
update_resolv_conf "$LOCAL_DNS_CONF"
|
||||
log_sync "DNS 配置同步完成"
|
||||
log_sync "首次同步完成,DNS 配置已更新"
|
||||
|
||||
# 清理临时文件
|
||||
rm -f "$remote_file"
|
||||
else
|
||||
log_error "无法下载远程 DNS 配置文件,同步失败"
|
||||
log_sync "同步失败:无法下载远程配置文件"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
log_info "本地 DNS 配置文件存在: $LOCAL_DNS_CONF"
|
||||
|
||||
# 下载远程配置文件进行比较
|
||||
if remote_file=$(download_remote_dns_conf); then
|
||||
# 比较文件
|
||||
if compare_files "$LOCAL_DNS_CONF" "$remote_file"; then
|
||||
log_info "DNS 配置文件无变化,无需更新"
|
||||
log_sync "DNS 配置文件无变化"
|
||||
else
|
||||
log_info "检测到 DNS 配置文件有变化,开始同步..."
|
||||
log_sync "检测到 DNS 配置文件变化,开始同步"
|
||||
|
||||
# 更新本地配置文件
|
||||
cp "$remote_file" "$LOCAL_DNS_CONF"
|
||||
log_success "本地 DNS 配置文件已更新"
|
||||
|
||||
# 更新 resolv.conf
|
||||
update_resolv_conf "$LOCAL_DNS_CONF"
|
||||
log_sync "DNS 配置同步完成"
|
||||
fi
|
||||
|
||||
# 清理临时文件
|
||||
rm -f "$remote_file"
|
||||
else
|
||||
log_error "无法下载远程 DNS 配置文件,跳过本次同步"
|
||||
log_sync "同步失败:无法下载远程配置文件"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
rm -f "$remote_file"
|
||||
log_success "DNS 同步流程完成"
|
||||
|
||||
log_success "DNS 同步检查完成"
|
||||
log_sync "DNS 同步检查完成"
|
||||
}
|
||||
|
||||
# 脚本入口
|
||||
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
|
||||
main "$@"
|
||||
fi
|
||||
|
||||
@ -1,59 +0,0 @@
|
||||
# 客户侧组件安装包构建、发布流程
|
||||
|
||||
## 第一步:配置版本和组件
|
||||
|
||||
首先搞定配置文件:
|
||||
|
||||
1. 把 `.checklist.example` 重命名成 `checklist`
|
||||
2. 把 `.VERSION.example` 重命名成 `VERSION`
|
||||
|
||||
### checklist 文件格式
|
||||
```
|
||||
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
|
||||
dcgm-exporter-installer /path/to/dcgm-exporter-installer 1.1.0
|
||||
node-exporter-installer /path/to/node-exporter-installer 1.1.0
|
||||
```
|
||||
|
||||
### VERSION 文件
|
||||
设置需要发布的版本号,比如 `1.29.0`
|
||||
|
||||
> 建议用 `version-manager.sh` 来管理版本
|
||||
|
||||
## 第二步:构建安装包
|
||||
|
||||
直接跑脚本:
|
||||
```bash
|
||||
./package_artifact.sh
|
||||
```
|
||||
|
||||
构建完的东西会放在 `artifact/` 目录下,按版本分文件夹。
|
||||
|
||||
如果版本已经存在了,想要覆盖重新构建:
|
||||
```bash
|
||||
./package_artifact.sh --force
|
||||
```
|
||||
|
||||
构建完可以手工测试安装包。
|
||||
|
||||
## 第三步:发布安装包
|
||||
|
||||
用这个脚本发布:
|
||||
```bash
|
||||
./publish_artifact.sh
|
||||
```
|
||||
|
||||
发布后的内容在 `publish/` 目录里,包含:
|
||||
- 压缩版本的安装包
|
||||
- 一键安装的bash脚本
|
||||
|
||||
## 第四步:部署到FTP服务器
|
||||
|
||||
把发布的内容上传到FTP服务器,客户端就可以通过一键命令安装:
|
||||
|
||||
```bash
|
||||
curl -fsSL http://your-ftp-server/install.sh | sh -
|
||||
|
||||
curl -fsSL "ftp://ftpuser:{PASSWD}!@10.211.55.4/share/setup.sh" | sudo bash -s -- --server 10.211.55.4 --user ftpuser --password {PASSWD}
|
||||
```
|
||||
|
||||
这样客户就能直接从FTP服务器下载并安装组件了。
|
||||
@ -1 +0,0 @@
|
||||
1.29.0
|
||||
@ -1,3 +0,0 @@
|
||||
# 组件名称 目录路径 版本号 [依赖组件] [安装顺序]
|
||||
dcgm-exporter-installer /Users/sundapeng/Project/nlp/aiops/client-plugins/dcgm-exporter-installer 1.1.0
|
||||
node-exporter-installer /Users/sundapeng/Project/nlp/aiops/client-plugins/node-exporter-installer 1.1.0
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user