diff --git a/specs/mvp/v2.5/README.md b/specs/mvp/v2.5/README.md index 0703a10..ebd49fd 100644 --- a/specs/mvp/v2.5/README.md +++ b/specs/mvp/v2.5/README.md @@ -12,3 +12,4 @@ v2.5 的核心变化: - `specs/mvp/v2.5/v2.5_api.md`:API 设计(用户、任务、队列、日志)与鉴权约定。 - `specs/mvp/v2.5/v2.5_acceptance.md`:开发/部署/验收流程与可验证标准。 - `specs/mvp/v2.5/v2.5_summary.md`:v2.5 已实现内容总结(本次迭代做了什么、验收结果、已知限制)。 +- `specs/mvp/v2.5/v2.5_container_design.md`:将 stateless pool 固化到单镜像(head/worker 复用 + supervisor 守护)的设计与验证流程。 diff --git a/specs/mvp/v2.5/v2.5_container_design.md b/specs/mvp/v2.5/v2.5_container_design.md new file mode 100644 index 0000000..653ea76 --- /dev/null +++ b/specs/mvp/v2.5/v2.5_container_design.md @@ -0,0 +1,202 @@ +# MVP v2.5 — Stateless Ray Node Pool 容器固化设计 + +目标:把 v2.5 的 **stateless pool(head discovery + worker watchdog)** 能力固化到一个可复用镜像中,避免依赖宿主机脚本在容器内 `docker exec` 启动/守护进程。**同一个镜像同时供 head/worker 复用**,通过环境变量区分角色。 +约束:**API server 代码与镜像解耦**,短期仍按现状“宿主机代码挂载到 head 容器,在 head 容器内启动 API”,不把 API 代码打进本镜像。 + +--- + +## 1. 背景(现状与痛点) + +当前 `src/mvp/docker-compose.yaml` 里 head/worker 都基于 `verlai/verl:sgl055.latest`,容器启动后 `command: sleep infinity`,再由宿主机脚本完成: +- head:`ray start --head ...` + `head_publisher`(写 `head.json`) +- worker:`worker_watchdog`(读取 `head.json`,自动加入/重连 ray 集群) + +现状问题: +- 启动流程依赖宿主脚本 `docker exec`,易受权限/路径/人为操作影响; +- “守护”目前是 bash while-loop,出现异常时排障成本高; +- 未来生产环境容器可能由算力平台拉起,我们只能 SSH 纳管,更需要把“自启动 + 自愈”放到容器内部。 + +--- + +## 2. v2.5 容器固化目标与非目标 + +### 2.1 目标 +- **一个镜像复用**:head/worker 统一镜像,通过 `ARGUS_ROLE=head|worker` 区分。 +- **supervisor 守护**:无论 head/worker,都使用 `supervisord` 守护关键进程: + - watchdog 崩溃 → supervisor 自动重启 watchdog + - ray 节点崩溃 → watchdog/或 supervisor 触发自动恢复(见 3.2 进程模型) +- **与共享存储对齐**:容器内统一挂载根路径 `/private`;discovery 文件写到共享存储。 +- **最小内置代码**:镜像只内置 stateless pool 相关 python 脚本(discovery/publisher/watchdog/entrypoint),不把 API 服务代码打进镜像。 + - **远端构建**:镜像构建必须在开发/运行机器(例如 `argus@h1`)上完成,本机不要求具备 `verlai/verl:*` 基础镜像。 + +### 2.2 非目标(本迭代不做) +- 不把 API server 打包进本镜像(后续可做单独 `argus-api` 镜像)。 +- 不改变 v2.5 TaskSpec 约束(仍使用 `/private/common/...` 公共资源;用户隔离只隔离 jobs)。 +- 不在本迭代引入 K8s/operator/autoscaler;只固化容器自启动/自愈。 + +--- + +## 3. 设计方案 + +### 3.1 单镜像架构概览 + +新增一个镜像(示例名): +- `argus/argus-ray-node:v2.5` + +该镜像: +- `FROM verlai/verl:sgl055.latest`(通过 build-arg 可切换 base) +- 内置: + - `argus_raypool`(或复用现有 `argus.ray.*` 子集)脚本: + - `discovery.py`:head record 读写(head.json) + - `head_publisher.py`:head 写入 head.json(带 TTL/刷新) + - `worker_watchdog.py`:worker 读取 head.json,自动加入/重连 + - (可选)`head_watchdog.py`:把 “ray head + publisher” 组装成一个可恢复的 watchdog + - `/usr/local/bin/argus-entrypoint.sh`:根据 role 生成 supervisor 配置并启动 supervisor + - supervisor 配置模板(或运行时生成) + +### 3.2 进程模型(确保“ray 崩/ watchdog 崩都能恢复”) + +用户新增要求:head/worker 均要 supervisor 守护 watchdog;ray 节点崩溃或 watchdog 崩溃都要自动恢复。 + +推荐进程组织(避免 “ray start” 后台化导致 supervisor 无法感知): + +#### A) Head 容器(ARGUS_ROLE=head) +由 supervisor 启动 **两个 program**: +1) `argus_head_watchdog`(推荐实现为 python 或 bash,内部用 `ray start --head --block` 前台运行) + - 关键点:`ray start --head --block` 让 Ray 进程前台阻塞,watchdog 作为父进程能感知退出码 + - ray 崩 → `ray start --block` 返回 → watchdog 退出非 0 → supervisor 重启 watchdog → ray 自动重启 +2) `argus_head_publisher` + - 定期刷新 `head.json`(TTL/refresh) + - publisher 崩 → supervisor 自动重启 + +> 备选:把 publisher 逻辑合并进 `argus_head_watchdog`(一个进程同时跑 ray + publisher 线程),减少 supervisor program 数量;但拆分更易观测与定位问题。 + +#### B) Worker 容器(ARGUS_ROLE=worker) +由 supervisor 启动 **一个 program**: +1) `argus_worker_watchdog` + - 轮询读取 `head.json`,并以 `ray start --address=
:6379 --block` 方式加入集群 + - 只要 ray 进程退出(ray 崩/被 stop),`--block` 结束,watchdog 进入下一轮重连/重启 + - watchdog 自己异常退出 → supervisor 自动重启 watchdog + +> 注意:当前仓库里的 `worker_watchdog.py` 是 “`ray start` 非 block + 仅在 head addr 变化时重启”。容器固化建议升级为 “`--block` + 监测 ray 退出” 模式,否则 supervisor 很难准确感知 ray 的生命周期。 + +### 3.3 配置与环境变量(Role 驱动) + +镜像入口只依赖环境变量,不依赖宿主脚本参数。 + +建议环境变量清单(含默认值): +- `ARGUS_ROLE`:`head` / `worker`(必填) +- `ARGUS_SHARED_ROOT`:默认 `/private` +- `ARGUS_CLUSTER_NAME`:默认 `argus-ray` +- `ARGUS_HEAD_IP_FILE`:默认 `${ARGUS_SHARED_ROOT}/ray/discovery/${ARGUS_CLUSTER_NAME}/head.json` +- `ARGUS_RAY_PORT`:默认 `6379` +- `ARGUS_DASHBOARD_PORT`:默认 `8265`(head) +- `ARGUS_TTL_S`:默认 `60`(head publisher) +- `ARGUS_REFRESH_S`:默认 `10`(head publisher) +- `ARGUS_POLL_S`:默认 `5`(worker watchdog) +- `ARGUS_NODE_IP`:默认空;若空则 entrypoint 自动探测容器 IP +- `ARGUS_WORKER_RESOURCES_KV`:默认 `worker_node=100`(用于 driver 强制落 worker 的自定义资源) +- `ARGUS_RAY_EXTRA_ARGS`:可选,传递额外 `ray start` 参数 +- `ARGUS_LOG_DIR`:默认 `${ARGUS_SHARED_ROOT}/common/logs`(落到共享目录便于排障) + +### 3.4 Dockerfile / entrypoint / supervisor 设计 + +#### Dockerfile(建议路径) +在仓库新增(后续实现时): +- `src/mvp/images/argus-ray-node/Dockerfile` +- `src/mvp/images/argus-ray-node/entrypoint.sh` +- `src/mvp/images/argus-ray-node/supervisord.conf.tmpl`(可选) +- `src/mvp/images/argus-ray-node/py/argus_raypool/*.py`(仅 stateless pool 子集) + +Dockerfile 关键动作: +- `FROM verlai/verl:sgl055.latest`(可 `ARG BASE_IMAGE=...`) +- 安装 supervisor: + - Debian/Ubuntu 基底:`apt-get update && apt-get install -y supervisor` + - 设定 `CMD ["supervisord","-n","-c","/etc/supervisor/supervisord.conf"]` +- 拷贝 python 脚本到 `/opt/argus/raypool` 并设置 `PYTHONPATH=/opt/argus` +- 拷贝 entrypoint 到 `/usr/local/bin/argus-entrypoint.sh` +- `ENTRYPOINT ["/usr/local/bin/argus-entrypoint.sh"]` + +entrypoint.sh 逻辑: +- 探测容器 IP(如 `hostname -i` 或 `ip route get 1.1.1.1`) +- 根据 `ARGUS_ROLE` 生成 supervisor 配置: + - head:启动 `head_watchdog` + `head_publisher` + - worker:启动 `worker_watchdog` +- 配置 supervisor: + - `autorestart=true` + - `startretries` 合理配置 + - stdout/stderr 指向 `${ARGUS_LOG_DIR}/...` 或直接 stdout(便于 `docker logs`) + +### 3.5 与 API server 的关系(保持解耦) +API server 仍按现状(短期方案): +- **代码存放在宿主机**,通过 volume mount 挂载到 head 容器(例如 `/workspace/mvp`)。 +- **在 head 容器内启动 API**(例如用脚本 `docker exec argus-ray-head ... python3 /workspace/mvp/py/server.py`)。 +- 关键点:即使 API 进程跑在 head 容器里,也仍视作“独立于 ray node 镜像的业务代码”,后续可独立演进为单独的 `argus-api` 镜像。 +- 只要 API 能访问 Ray job server(通常 `http://127.0.0.1:8265` 在 head 容器视角)即可。 + +未来(非本迭代)可将 API server 单独做 `argus-api` 镜像,按相同 `/private` 共享目录运行。 + +--- + +## 4. docker-compose 调整建议(后续实现) + +当前 compose 的变化点(概念上): +- `image: verlai/verl:sgl055.latest` → `image: argus/argus-ray-node:v2.5` +- `command: sleep infinity` 移除(镜像自带 entrypoint) +- head service 增加: + - `ARGUS_ROLE=head` + - 暴露 dashboard 端口保持 `8265:8265` +- worker service 增加: + - `ARGUS_ROLE=worker` + - `ARGUS_WORKER_RESOURCES_KV=worker_node=100` +- volumes 仍需要: + - `../../shared:/private`(共享存储) + - `../../verl:/workspace/verl`(verl 代码/依赖按现状) + +--- + +## 5. 验证与回归流程(落地后怎么验收) + +### 5.1 构建镜像 +1) **在远端 `argus@h1` 构建**(本机不要求具备基础镜像): + - `cd /home2/argus/infra/mvp/src/mvp` + - `docker build -t argus/argus-ray-node:v2.5 -f images/argus-ray-node/Dockerfile .` +2) 也可以使用 compose build(推荐,和实际运行一致): + - `docker compose -f docker-compose.yaml build --no-cache` + +### 5.2 基础连通性(stateless pool 验证) +1) `docker compose up -d` +2) 验证 head 写入: + - 共享目录存在 `head.json`:`${ARGUS_SHARED_ROOT}/ray/discovery/${ARGUS_CLUSTER_NAME}/head.json` +3) 验证 worker 自动加入: + - 在 head 容器内 `ray status` 能看到 worker 节点加入 + - Dashboard Nodes 页面能看到 head + worker + +### 5.3 故障注入(supervisor 自愈验证) +1) watchdog 崩溃: + - `pkill -f worker_watchdog`(或 kill 对应 PID) + - 期望:supervisor 自动拉起 watchdog;worker 最终重新加入集群 +2) ray 节点崩溃(worker): + - `ray stop --force` 或 kill raylet + - 期望:watchdog 重新执行 `ray start ... --block`,worker 恢复 +3) ray 节点崩溃(head): + - kill head ray 前台进程(由 watchdog 启动) + - 期望:supervisor 重启 head_watchdog;head 恢复并重写 head.json;workers 自动重连 + +### 5.4 端到端任务回归(与 v2.5 API 协作) +沿用现有 v2.5 E2E: +- `src/mvp/scripts/run_all_v25_api.sh` +- `src/mvp/scripts/run_e2e_v25_cases.sh` + +验收标准: +- PPO/GRPO/SFT 均能在 worker 上运行,head 不跑训练 +- API 的 task_id / submission_id 正常携带用户名 +- 资源不足可转 `PENDING_RESOURCES` 并按周期重试 + +--- + +## 6. 风险点与对策 + +- **ray start 后台化**:如果继续后台启动,supervisor 不易感知 ray 崩溃。对策:使用 `ray start --block`(推荐)。 +- **IP 探测不稳定**:不同环境(compose/平台)容器 IP 获取方式不同。对策:entrypoint 做多策略探测并允许 `ARGUS_NODE_IP` 显式覆盖。 +- **日志可观测性**:建议同时支持写到 `/private/common/logs`(共享)以及 stdout(`docker logs`)。 diff --git a/src/mvp/docker-compose.yaml b/src/mvp/docker-compose.yaml index 7759941..a74c1a3 100644 --- a/src/mvp/docker-compose.yaml +++ b/src/mvp/docker-compose.yaml @@ -1,10 +1,12 @@ -version: "3.8" - services: ray_head: - image: verlai/verl:sgl055.latest + build: + context: . + dockerfile: images/argus-ray-node/Dockerfile + args: + BASE_IMAGE: verlai/verl:sgl055.latest + image: argus/argus-ray-node:v2.5 container_name: argus-ray-head - command: sleep infinity ports: - "8265:8265" - "8080:8080" @@ -26,6 +28,10 @@ services: networks: - argus-ray-net environment: + ARGUS_ROLE: "head" + ARGUS_SHARED_ROOT: "/private" + ARGUS_CLUSTER_NAME: "argus-ray" + ARGUS_LOG_DIR: "/private/common/logs" HF_HOME: "/private/hf" HUGGINGFACE_HUB_CACHE: "/private/hf/hub" TRANSFORMERS_CACHE: "/private/hf/transformers" @@ -33,9 +39,8 @@ services: PYTHONUNBUFFERED: "1" ray_worker_0: - image: verlai/verl:sgl055.latest + image: argus/argus-ray-node:v2.5 container_name: argus-ray-worker-0 - command: sleep infinity volumes: - ../../verl:/workspace/verl - ../../shared:/private @@ -52,6 +57,11 @@ services: - argus-ray-net runtime: nvidia environment: + ARGUS_ROLE: "worker" + ARGUS_SHARED_ROOT: "/private" + ARGUS_CLUSTER_NAME: "argus-ray" + ARGUS_LOG_DIR: "/private/common/logs" + ARGUS_WORKER_RESOURCES_KV: "worker_node=100" NVIDIA_VISIBLE_DEVICES: "0,1,2,3" NVIDIA_DRIVER_CAPABILITIES: "all" HF_HOME: "/private/hf" @@ -61,9 +71,8 @@ services: PYTHONUNBUFFERED: "1" ray_worker_1: - image: verlai/verl:sgl055.latest + image: argus/argus-ray-node:v2.5 container_name: argus-ray-worker-1 - command: sleep infinity volumes: - ../../verl:/workspace/verl - ../../shared:/private @@ -80,6 +89,11 @@ services: - argus-ray-net runtime: nvidia environment: + ARGUS_ROLE: "worker" + ARGUS_SHARED_ROOT: "/private" + ARGUS_CLUSTER_NAME: "argus-ray" + ARGUS_LOG_DIR: "/private/common/logs" + ARGUS_WORKER_RESOURCES_KV: "worker_node=100" NVIDIA_VISIBLE_DEVICES: "4,5,6,7" NVIDIA_DRIVER_CAPABILITIES: "all" HF_HOME: "/private/hf" diff --git a/src/mvp/images/argus-ray-node/Dockerfile b/src/mvp/images/argus-ray-node/Dockerfile new file mode 100644 index 0000000..f90eb30 --- /dev/null +++ b/src/mvp/images/argus-ray-node/Dockerfile @@ -0,0 +1,26 @@ +ARG BASE_IMAGE=verlai/verl:sgl055.latest +FROM ${BASE_IMAGE} + +SHELL ["/bin/bash", "-lc"] + +# Install supervisord (prefer pip to avoid relying on distro package manager). +RUN python3 -m pip install --no-cache-dir supervisor + +RUN mkdir -p /opt/argus/py/argus/ray + +# Minimal embedded code for stateless pool (API code is intentionally excluded). +COPY py/argus/__init__.py /opt/argus/py/argus/__init__.py +COPY py/argus/ray/__init__.py /opt/argus/py/argus/ray/__init__.py +COPY py/argus/ray/discovery.py /opt/argus/py/argus/ray/discovery.py +COPY py/argus/ray/head_publisher.py /opt/argus/py/argus/ray/head_publisher.py +COPY py/argus/ray/worker_watchdog.py /opt/argus/py/argus/ray/worker_watchdog.py + +COPY images/argus-ray-node/entrypoint.sh /usr/local/bin/argus-entrypoint.sh +COPY images/argus-ray-node/argus-head-ray.sh /usr/local/bin/argus-head-ray.sh +COPY images/argus-ray-node/argus-head-publisher.sh /usr/local/bin/argus-head-publisher.sh +COPY images/argus-ray-node/argus-worker-watchdog.sh /usr/local/bin/argus-worker-watchdog.sh +RUN chmod +x /usr/local/bin/argus-entrypoint.sh /usr/local/bin/argus-head-ray.sh /usr/local/bin/argus-head-publisher.sh /usr/local/bin/argus-worker-watchdog.sh + +ENV PYTHONUNBUFFERED=1 + +ENTRYPOINT ["/usr/local/bin/argus-entrypoint.sh"] diff --git a/src/mvp/images/argus-ray-node/argus-head-publisher.sh b/src/mvp/images/argus-ray-node/argus-head-publisher.sh new file mode 100644 index 0000000..26a90f7 --- /dev/null +++ b/src/mvp/images/argus-ray-node/argus-head-publisher.sh @@ -0,0 +1,20 @@ +#!/usr/bin/env bash +set -euo pipefail + +cluster_name="${ARGUS_CLUSTER_NAME:-argus-ray}" +head_ip_file="${ARGUS_HEAD_IP_FILE:?missing ARGUS_HEAD_IP_FILE}" +node_ip="${ARGUS_NODE_IP:?missing ARGUS_NODE_IP}" +ray_port="${ARGUS_RAY_PORT:-6379}" +dashboard_port="${ARGUS_DASHBOARD_PORT:-8265}" +ttl_s="${ARGUS_TTL_S:-60}" +refresh_s="${ARGUS_REFRESH_S:-10}" + +exec python3 -m argus.ray.head_publisher \ + --cluster-name "${cluster_name}" \ + --head-ip-file "${head_ip_file}" \ + --head-ip "${node_ip}" \ + --gcs-port "${ray_port}" \ + --dashboard-port "${dashboard_port}" \ + --ttl-s "${ttl_s}" \ + --refresh-s "${refresh_s}" + diff --git a/src/mvp/images/argus-ray-node/argus-head-ray.sh b/src/mvp/images/argus-ray-node/argus-head-ray.sh new file mode 100644 index 0000000..1e873dc --- /dev/null +++ b/src/mvp/images/argus-ray-node/argus-head-ray.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +set -euo pipefail + +node_ip="${ARGUS_NODE_IP:?missing ARGUS_NODE_IP}" +ray_port="${ARGUS_RAY_PORT:-6379}" +dashboard_port="${ARGUS_DASHBOARD_PORT:-8265}" + +export RAY_DISABLE_USAGE_STATS=1 + +ray stop --force || true + +# Head should not run training workloads. +exec ray start \ + --head \ + --node-ip-address="${node_ip}" \ + --port="${ray_port}" \ + --dashboard-host=0.0.0.0 \ + --dashboard-port="${dashboard_port}" \ + --num-cpus=0 \ + --num-gpus=0 \ + --disable-usage-stats \ + --block \ + ${ARGUS_RAY_EXTRA_ARGS:-} + diff --git a/src/mvp/images/argus-ray-node/argus-worker-watchdog.sh b/src/mvp/images/argus-ray-node/argus-worker-watchdog.sh new file mode 100644 index 0000000..cfd1688 --- /dev/null +++ b/src/mvp/images/argus-ray-node/argus-worker-watchdog.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +set -euo pipefail + +head_ip_file="${ARGUS_HEAD_IP_FILE:?missing ARGUS_HEAD_IP_FILE}" +node_ip="${ARGUS_NODE_IP:?missing ARGUS_NODE_IP}" +poll_s="${ARGUS_POLL_S:-5}" + +resources_kv="${ARGUS_WORKER_RESOURCES_KV:-worker_node=100}" + +args=() +IFS=',' read -r -a parts <<<"${resources_kv}" +for kv in "${parts[@]}"; do + kv="$(echo "${kv}" | xargs || true)" + [[ -z "${kv}" ]] && continue + args+=( "--resources-kv" "${kv}" ) +done + +exec python3 -m argus.ray.worker_watchdog \ + --head-ip-file "${head_ip_file}" \ + --node-ip "${node_ip}" \ + "${args[@]}" \ + --poll-s "${poll_s}" \ + --ray-start-block + diff --git a/src/mvp/images/argus-ray-node/entrypoint.sh b/src/mvp/images/argus-ray-node/entrypoint.sh new file mode 100644 index 0000000..1031f5c --- /dev/null +++ b/src/mvp/images/argus-ray-node/entrypoint.sh @@ -0,0 +1,93 @@ +#!/usr/bin/env bash +set -euo pipefail + +role="${ARGUS_ROLE:-}" +if [[ -z "${role}" ]]; then + echo "ERROR: ARGUS_ROLE must be set to head|worker" >&2 + exit 1 +fi +if [[ "${role}" != "head" && "${role}" != "worker" ]]; then + echo "ERROR: invalid ARGUS_ROLE=${role} (expected head|worker)" >&2 + exit 1 +fi + +shared_root="${ARGUS_SHARED_ROOT:-/private}" +cluster_name="${ARGUS_CLUSTER_NAME:-argus-ray}" +head_ip_file="${ARGUS_HEAD_IP_FILE:-${shared_root}/ray/discovery/${cluster_name}/head.json}" + +ray_port="${ARGUS_RAY_PORT:-6379}" +dashboard_port="${ARGUS_DASHBOARD_PORT:-8265}" + +ttl_s="${ARGUS_TTL_S:-60}" +refresh_s="${ARGUS_REFRESH_S:-10}" +poll_s="${ARGUS_POLL_S:-5}" + +log_dir="${ARGUS_LOG_DIR:-${shared_root}/common/logs}" +mkdir -p "${log_dir}" + +node_ip="${ARGUS_NODE_IP:-}" +if [[ -z "${node_ip}" ]]; then + # Prefer the first IP returned by `hostname -i`. + node_ip="$(hostname -i | awk '{print $1}')" +fi +if [[ -z "${node_ip}" ]]; then + echo "ERROR: failed to determine container IP; set ARGUS_NODE_IP" >&2 + exit 1 +fi + +export PYTHONPATH="/opt/argus/py:${PYTHONPATH:-}" +export ARGUS_SHARED_ROOT="${shared_root}" +export ARGUS_CLUSTER_NAME="${cluster_name}" +export ARGUS_HEAD_IP_FILE="${head_ip_file}" +export ARGUS_NODE_IP="${node_ip}" +export ARGUS_RAY_PORT="${ray_port}" +export ARGUS_DASHBOARD_PORT="${dashboard_port}" +export ARGUS_TTL_S="${ttl_s}" +export ARGUS_REFRESH_S="${refresh_s}" +export ARGUS_POLL_S="${poll_s}" +export ARGUS_LOG_DIR="${log_dir}" + +supervisor_conf="/tmp/supervisord.conf" +cat >"${supervisor_conf}" <<'CONF' +[supervisord] +nodaemon=true +pidfile=/tmp/supervisord.pid +logfile=/dev/null +childlogdir=/tmp +user=root + +[supervisorctl] +serverurl=unix:///tmp/supervisor.sock + +[unix_http_server] +file=/tmp/supervisor.sock +chmod=0700 +CONF + +append_program() { + local name="$1" + local cmd="$2" + local stdout="$3" + local stderr="$4" + cat >>"${supervisor_conf}" <