Compare commits

...

3 Commits

58 changed files with 6081 additions and 1 deletions

3
.gitignore vendored
View File

@ -1,3 +1,4 @@
verl/
skypilot-ssh-test/
ray_in_docker/
ray_in_docker/
__pycache__/

34
specs/mvp/milestones.md Normal file
View File

@ -0,0 +1,34 @@
# milestones
通过以下几个里程碑来梳理和分析确认可行性最终目标是产出一套基于Native Ray集群无k8s底座的verl 训练平台支持多用户运行各类verl任务提高整体集群的资源利用效率并且能够通过监测系统进行观察和资源统计监控报警。未来形成运维SOP后接入运维智能体执行自动化运维。
- Workload
- ppo on ray
- grpo on ray
- sft on ray 可行性
- model serving on ray
- customize code 自定义代码任意verl example 提交代码
- 自定义reward function
- 同时多verl版本支持同时跑不同的ray任务但是使用不同版本的verl甚至是用户魔改版本
- Ray Job管理
- 通过python api提交而不是通过ray cli提交
- 任务排队机制。无优先级多个pending job谁先满足资源就谁先执行。
- 【确认支持】gang scheduling (all or nothing), 指定好trainer.nnodes和trainer.n_gpus_per_node参数不满足就pending。
- 无配额管理、公平调度等特性。
- Ray本身不支持任务超时参数需要单独job监控发现超时才停止。
- Pipeline管理【高级, 暂不实现】
- 提供对Ray Job进一步封装串联多个Ray Job自动完成训练模型合并等job串联
- 可观测性 Observability
- 测试本地部署 weight and bias server 可行性如何集成现有job流程
- 测试部署 prometheus & grafana对ray节点进行监测
- job监控哪些job使用了多少资源跑了多长时间资源利用率是否充分是否空占着GPU
- 数据、模型存储管理
- shared dataset管理所有用户共享的hf数据集
- hf 模型管理所有用户共享的hf 基座模型库
- user dataset 管理: 用户独自的数据集管理
- user 模型管理:用户独自的模型管理,保存训练好的模型
- job 作业数据管理,作业产出物,临时目录数据
- user management用户可以通过统一界面来管理自己是user dataset/model space和自己运行的job的临时目录从而灵活组织任务流水线提供灵活的文件查看方式
- 网络
- 确认是否支持IB(H100环境)以及RoCEv2H20环境需要怎样配置

348
specs/mvp/mvp_roadmap.md Normal file
View File

@ -0,0 +1,348 @@
# MVP RoadmapV1 → V2 → … → 训练平台)
本文档在 `specs/mvp/milestones.md` 的草稿基础上做**扩展与细化**把目标拆成可迭代的版本MVP v1/v2/…),保证每个版本都能**独立运行、可验证验收**,并且在上一版本基础上演进。
> 总目标North Star产出一套**基于 Native Ray 集群(无 K8s 底座)**的训练平台,面向多用户,支持 `verl` 各类训练/评测/Serving 工作负载,提升集群利用率,并通过可观测系统实现资源统计、监控告警,最终形成运维 SOP 并可接入运维智能体做自动化运维。
---
## 0. 关键原则(贯穿所有版本)
1) **版本可独立运行**:每个版本都能从“空环境”按文档跑起来(不依赖未来能力)。
2) **验收可客观验证**:每个里程碑必须有明确的 DoDDefinition of Done与可复现步骤。
3) **强制产物落盘**:模型/数据/日志/ckpt 必须可追踪、可复用、可审计(基于共享存储/NFS
4) **Head 不参与计算**Head 只承担控制面GCS/Dashboard/Job server避免训练抢占控制面资源。
5) **按 submission id 组织作业**:作业输出目录与 Ray submission id 绑定,方便检索、回收、归档。
6) **“先把 RL 跑稳”,再扩 workload**:先 PPO已验证再 GRPO/SFT/Serving。
---
## 0.1 里程碑总览(建议交付顺序)
| 版本 | 定位 | 关键交付 | 核心验收点 |
|---|---|---|---|
| v1 | 可复现实验闭环 | Ray 集群 + PPO 跑通 + 持久化 | driver 不在 head产物落盘 |
| v1.1 | 实验工程化 | JobSpec 模板 + 新增 1 个 workload | 可回归、可定位、可扩展 |
| v2.0 | 服务化入口 | API + Ray Jobs SDK | API 提交/查询/停止可用 |
| v2.1 | 节点纳管 | SSH 注入 + 资源池/标签 | 节点上线/下线、gang 约束 |
| v3.0 | 平台雏形 | 队列 + 超时 + 最小多用户 | pending→running 自动调度 |
| v3.1 | 可扩展平台 | 自定义代码/reward + 多版本 | 多版本并存、插件可用 |
| v4.0 | 可运营平台 | Prom/Grafana + W&B | 资源核算/告警/归档 |
| v4.1 | 可交接平台 | SOP + 自动化运维接口 | 非开发可按 SOP 运维 |
| v5.0 | 长期形态 | Serving + Pipeline | 训练→发布推理闭环 |
## 1. 当前基线MVP v1已完成/已验证)
### 1.1 目标
在单机(或同一宿主机)用 3 个容器跑通:
- Ray head无 GPUCPU=0/GPU=0
- 2 个 Ray worker每个 4 GPU
- 通过 **head 上的 `ray job submit`** 提交 `verl` PPO`total_epochs=1`
- 通过 **entrypoint 自定义资源**强制 driver 在 worker 上
- 数据/模型/日志/ckpt 全部持久化
### 1.2 交付物repo 中已存在)
- 脚本与 compose`src/mvp/v1/`
- 行动与验收文档:`specs/mvp/v1/v1_action.md`
- 共享目录约定:`shared/datasets``shared/hf``shared/jobs` 等(与 NFS 对齐)
### 1.3 验收口径(摘要)
- `ray job list``driver_info.node_ip_address` ∈ worker IP且 ≠ head IP
- 训练输出落在 `/mnt/shared/jobs/<submission_id>/...`
- checkpoint 按 `save_freq` 产生(避免爆磁盘)
---
## 2. MVP v1.1Hardening + 多 workload 可行性验证)
> 目标:把 v1 从“实验脚本”升级成“可长期回归的最小系统”,并验证更多 workload 的可行性边界。
### 2.1 主要能力
- Workload 扩展(可选顺序):
- PPO回归金标
- GRPO on Ray可运行验证
- SFT on Ray可运行验证`llamafactory``verl` 相关 SFT 路径)
- 作业模板化(最小实现):
- 统一 JobSpecYAML/JSON描述workload 类型、资源nnodes/n_gpus_per_node、数据、模型、输出目录、超时
- 仍然用 `ray job submit`,但把 entrypoint 组装逻辑标准化
- checkpoint 策略与磁盘保护:
- 默认 `save_freq` ≥ 10或按训练总 steps 的比例)
- 明确保留策略(至少提供“保留最后 N 个 ckpt”的配置建议/脚本)
- “失败可定位”:
- 统一收敛日志入口Ray job logs + hydra 日志目录 + 关键参数快照)
- 失败时能定位:是资源不足 / NCCL / 数据 / 模型 / 配置错误
### 2.2 验收DoD
- 同一套脚本在同一台机器能连续跑 3 次 PPO 回归,产物目录不互相覆盖
- 至少新增 1 个 workloadGRPO 或 SFT可以跑通 “启动→训练→落盘” 闭环
- 作业目录内包含:
- `config/submit_cmd.txt`(或 job spec 快照)
- `logs/`(可追踪)
- `checkpoints/`(按策略生成)
---
## 3. MVP v2.0Control Plane 服务化API + Ray Jobs SDK
> 目标:从“人跑脚本”升级为“服务提交任务”。依然是 Native Ray 集群,但引入一个最小控制平面服务。
### 3.1 系统形态
- Control Plane建议部署在 head/CPU 机器):
- FastAPI 服务REST
- Job 管理:用 Ray Jobs **Python SDK** 提交/查询/停止(不再依赖 CLI 文本解析)
- 节点视图:读取 Ray statenodes, actors, placement groups
- Data Plane
- 仍然是预先启动的 worker 节点加入集群(先不做 SSH 动态纳管也可)
### 3.2 APIMVP 级别)
- `POST /v1/jobs`:提交 JobSpecppo/grpo/sft
- `GET /v1/jobs`:列表(含状态、资源、开始/结束时间)
- `GET /v1/jobs/{id}`详情含输出目录、driver node
- `POST /v1/jobs/{id}:stop`:停止作业
### 3.3 验收DoD
- API 提交 PPO返回 submission id输出目录为 `/mnt/shared/jobs/<submission_id>/...`
- API 查询 job 状态与 driver node必须是 worker
- 停止 job 后,资源释放、状态可见
---
## 4. MVP v2.1SSH 纳管 + 资源池 + Gang 约束)
> 目标对齐你草稿里“SSH 纳管”的约束与需求:控制面能纳管 GPU 节点,形成可运营的资源池。
### 4.1 节点纳管SSH Provisioner
- 控制面保存 NodeSpecip/user/port/labels/gpu_count
- 通过 SSH 执行:
- `ray start --address=<head>:6379 --resources=...`
- `ray stop`drain/下线)
- 维护节点状态机:`pending → online → draining → offline`
### 4.2 资源池与 gangAll-or-nothing
- 资源池最小模型:
- pool 标签(如 `pool_a``h20``ib_domain_1`
- 提交 job 时指定 pool 约束
- Gang 约束MVP 实现方式):
- job spec 明确 `trainer.nnodes` + `trainer.n_gpus_per_node`
- 提交前检查 Ray 可用资源是否满足,不满足则进入 pending 队列(见 v3.0
### 4.3 验收DoD
- 通过 API 注册 2 个 workerSSH 注入 ray start`ray status` 可见节点上线
- 通过 API 下线节点,节点被标记不可调度且不再分配新 job
- gang 不满足时 job 不提交(或提交后一直 pending满足后可运行
---
## 5. MVP v3.0(调度与多用户:队列 + 超时 + 最小权限)
> 目标:平台开始“像个平台”:多用户、队列、超时、审计。仍然不做复杂配额/公平调度。
### 5.1 作业队列(简单但可用)
- FIFO 队列:无优先级
- “资源满足就调度”:谁先满足谁先跑(可接受非严格 FIFO
- job 超时Ray 原生不支持统一 timeout草稿已指出因此控制面需
- 记录 start_time
- 定期扫描超时 job → `stop`
### 5.2 多用户最小闭环
- 认证MVPtoken 或 basic auth先不做复杂 RBAC
- 归属与隔离(文件层):
- `/mnt/shared/users/<user>/datasets/`
- `/mnt/shared/users/<user>/models/`
- `/mnt/shared/jobs/<submission_id>/` 记录 user/metadata
### 5.3 验收DoD
- 2 个用户可各自提交 job能看到自己的 job 列表与输出目录
- 超时策略可触发(模拟短 timeoutjob 被停止且状态标记为 timeout
- 队列在资源不足时保持 pending资源释放后自动运行
---
## 6. MVP v3.1(可扩展性:自定义代码/Reward、多版本 VERL
> 目标:把“平台内置 workload”升级成“用户可提交自定义代码与 reward”并支持多版本并存。
### 6.1 自定义代码提交(最小实现)
两种方式二选一(建议先做 A
- A`working_dir` 指向 NFS 上的代码快照目录(用户自己准备/上传)
- B上传 zip控制面落到 NFS 并解压为 code snapshot
### 6.2 多版本 VERL 并存
约束前提:**基础镜像保持同一个**(生产环境容器由算力平台创建时已固定镜像标签)。
目标:在同一 Ray 集群内,不同 job 可以使用不同版本的 `verl`(例如不同分支/commit 或用户魔改版)。
已确认优先方案A**必须通过 Ray Job 的 `runtime_env.env_vars` 透传 `PYTHONPATH`**,让 job 粒度优先 import 指定代码快照。
建议方案(以 NFS 为中心,最小可行实现):
- 在共享存储上以“不可变快照”的方式存放代码版本(推荐 commit hash 命名):
- `${SHARED_ROOT}/common/code/verl/<commit>/...`
- `${SHARED_ROOT}/users/<user>/code/verl/<commit>/...`(用户魔改版)
- JobSpec 增加 `code_path`(指向上述目录),控制面在提交 job 时注入(必须走 runtime_env
- `runtime_env.env_vars.PYTHONPATH = "<code_path>:$PYTHONPATH"`(把 code_path 放最前面,确保 import 优先级)
示例(概念性,实际以 `${SHARED_ROOT}` 为准):
```bash
CODE_PATH="${SHARED_ROOT}/common/code/verl/<commit>"
ray job submit \
--address="http://127.0.0.1:8265" \
--submission-id="<submission_id>" \
--runtime-env-json='{"env_vars": {"PYTHONPATH": "'"${CODE_PATH}"':$PYTHONPATH"}}' \
-- \
python3 -m verl.trainer.main_ppo ...
```
需要验证的关键点(作为 v3.1 的 DoD 之一):
- 同时运行两个 job
- jobA 使用 `<commitA>`jobB 使用 `<commitB>`
- 互不影响,且各自训练/日志/ckpt 正常
- job 粒度是否能做到“依赖隔离”(至少做到 `verl` 版本隔离;第三方依赖冲突可先假设镜像内一致)
> 备注:当前 v1 的做法是容器内全局 `pip install -e /workspace/verl`,这会让所有 job 默认使用同一份 `verl`。要实现多版本并存,必须让 job 的 import 优先使用 `code_path`(或为每个 job 单独创建 venv/安装 wheel后者更重建议后置
### 6.3 自定义 reward function
- JobSpec 支持 `reward_fn_path`Python 模块路径)
- `reward_fn_path` 可指向共享存储中用户自定义代码目录(例如 `${SHARED_ROOT}/users/<user>/code/...`
- 约束:代码必须在 job runtime 中可 import`working_dir`/`PYTHONPATH` 或 runtime_env 保障)
- 控制面校验模块可导入basic lint/安全白名单可后置)
### 6.4 验收DoD
- 同时运行两个 job使用不同的 `verl` 代码版本(或用户魔改版本),互不影响
- 用户可在 JobSpec 中替换 reward function 并跑通一个最小训练闭环
---
## 7. MVP v4.0可观测性Prometheus/Grafana + W&B 集成)
> 目标:平台可运营:能回答“谁在用多少资源、跑了多久、利用率如何、是否空占 GPU”。
### 7.1 指标与监控
- Ray 指标接入 Prometheus节点/任务/actor
- GPU 指标nvidia exporter 或 DCGM exporter
- DashboardGrafana至少 3 张核心面板)
- 集群总 GPU/CPU 使用率、空闲率
- 每 job 的 GPU 时间、峰值显存、运行时长
- 节点健康(心跳/掉线)与告警
### 7.2 W&B或等价集成验证
- 最小可行:单机 self-host W&B server 可用性验证
- JobSpec 支持启用/关闭 W&B并传入 project/run name
### 7.3 验收DoD
- Grafana 上能看到集群与 job 资源视图
- 某个 job GPU 利用率异常(模拟)能触发告警规则(邮件/IM/日志即可)
- W&B 指标能按 job 维度归档(至少 PPO 能上报)
---
## 8. MVP v4.1运维化SOP + 自动化运维接口)
> 目标:把平台变成“可交接”的系统:运维动作标准化,并为智能体留出接口。
### 8.1 SOP 与自动化入口
- SOP 文档:
- 节点上线/下线
- 故障定位Ray session、Ray job、NCCL、OOM
- 资源回收(停止 job、清理 ckpt
- 自动化接口(最小):
- `/v1/ops/drain_node`
- `/v1/ops/restart_ray_head`(谨慎:需要保护与权限)
- `/v1/ops/cleanup_job_artifacts`
### 8.2 验收DoD
- 按 SOP非开发人员可完成一次“节点上线→跑任务→下线→清理”
- 自动化接口至少能完成 1 个高频动作(如清理/停止/下线)
---
## 9. MVP v5.0Serving 与 Pipeline偏长期
> 目标:训练-部署一体化:支持 model serving并在平台内串联训练→评测→发布。
### 9.1 Serving
- Ray Serve或等价部署模型推理服务
- Serving 与训练共用模型库与权限(按 user/project
### 9.2 Pipeline草稿里标为高级
- Pipeline 是对多个 job 的封装训练→merge→eval→publish
- 可先实现最小 DAG两步串联作为验证
### 9.3 验收DoD
- 训练产物一键发布为一个可访问的推理 endpoint
- Pipeline 能自动串联并产出最终 artifact可回滚/可追踪)
---
## 10. 并行技术验证(建议尽早做)
这些属于“跨版本”风险项,建议在 v1.1 ~ v2.0 期间尽早做:
### 10.1 网络IB / RoCEv2
- 确认环境是否支持 IBH100或 RoCEv2H20
- 跑最小 NCCL 通信验证all-reduce / bandwidth
- 将必要的 NCCL 环境变量注入到 job runtime_env
### 10.2 Ray + 多节点容器约束
- 多容器同宿主机时的 Ray node_ip/临时目录冲突规律(已踩坑,需固化规范)
- 端口范围与防火墙策略Ray worker 端口、dashboard、metrics
---
## 11. 已确认的约束与假设(来自讨论结论)
这些会直接影响 v2.1SSH 纳管)与后续多用户/存储设计:
1) **最终形态仍以“每节点容器”运行**(不是裸机 systemd
- H20 开发环境:我们可在宿主机用 `docker compose` 自建容器,并通过 SSH 进入容器调试/纳管。
- H100 生产环境:容器由算力平台创建/回收;平台侧控制面只能 **SSH 进入这些容器** 做纳管(执行 `ray start/stop`、注入 env 等)。
2) **认证**:内部 token 即可MVP 阶段不对接 SSO
3) **存储**:只考虑 NFS。
- 开发环境NFS/共享目录可通过宿主机 bind mount 提供给容器。
- 生产环境:所有容器挂载相同 NFS容器内共享根路径为 `/private/`(需要在实现时把“共享根路径”做成可配置项,而不是写死 `/mnt/shared`)。
4) **网络拓扑约束**:暂不做按 IB 域/机架/拓扑的强约束调度(第 10.1 仍需验证 IB/RoCE 是否可用与配置方式,但调度不引入拓扑维度)。
5) **共享目录分层**:在 `users/<user>/...` 之外增加一个可读写的 `common/` 目录用于共享数据/模型/代码:
- `${SHARED_ROOT}/common/datasets/`
- `${SHARED_ROOT}/common/models/`
- `${SHARED_ROOT}/common/code/`
- 权限MVP先默认“所有内部 token 用户可读写”,后续再细化只读/受控写。
---
## 12. 仍需你确认/讨论的问题(剩余不确定项)
1) `runtime_env.env_vars` 注入对“子进程/训练框架内部启动进程”的覆盖范围是否足够?
- 需要确认 `verl`/`sglang` 等子进程是否继承 driver 的环境变量(通常会继承,但建议在 v3.1 验收时明确验证)。

169
specs/mvp/v1.1/mvp_plan.md Normal file
View File

@ -0,0 +1,169 @@
# MVP v1.1 计划Hardening + 多 Workload 可行性验证)
本目录是 `specs/mvp/v1/` 的下一步迭代:在 v1 已经跑通Ray head + 2 workerPPO on Ray持久化落盘的基础上把它升级为**可长期回归**的最小系统,并扩展至少一个新 workload 的可行性闭环。
> v1.1 的目标不是做平台服务化API/队列/多用户)——那是 v2/v3 的工作v1.1 聚焦“工程化 + 可行性边界验证 + 可观测/可排障基础”。
---
## 1. v1 基线回顾(已完成)
- 拓扑1 head无 GPUCPU/GPU=0+ 2 worker各 4 GPU
- 提交方式:必须用 head 上的 `ray job submit`
- driver 调度:通过 `worker_node` 自定义资源 + `--entrypoint-resources` 强制 driver 在 worker
- 输出:按 `submission_id` 组织到共享目录NFS
相关实现参考:
- 脚本:`src/mvp/v1/`
- 验收动作:`specs/mvp/v1/v1_action.md`
- Roadmap`specs/mvp/mvp_roadmap.md`
---
## 2. v1.1 目标(必须达成)
### 2.1 工程化Hardening
1) **JobSpec 标准化(最小)**
- 把“提交 job 需要的参数”收敛成结构化文件:
- Ray 基础配置YAMLcluster 地址、entrypoint 资源约束、runtime_env 等
- 训练 JobSpecYAMLworkload 语义与训练参数
- 至少覆盖:`submission_id`、workload 类型、资源需求、共享根路径、模型/数据路径、输出目录、超时、环境变量注入。
- v1.1 实现落点(已在 repo 里提供SDK 方式):
- RayConfig 示例:`src/mvp/v1.1/py/configs/dev.yaml`
- JobSpec 示例:`src/mvp/v1.1/py/jobspecs/{ppo,grpo,sft}.yaml`
- 提交入口:`src/mvp/v1.1/py/run.py`(在 head 容器内执行,使用 Ray Python SDK 提交)
- 设计文档:`specs/mvp/v1.1/sdk_submit_refactor.md`
2) **共享根路径抽象dev/prod 一致)**
- 引入 `SHARED_ROOT` 作为唯一共享根路径:
- dev建议也用 `/private`docker compose 把宿主机 shared 挂到容器内 `/private`,模拟生产)
- prod固定 `/private`(算力平台容器内 NFS
- 任何代码/脚本不得写死 `/mnt/shared`(允许兼容旧路径但不得作为主路径)。
3) **共享目录分层(新增 `common/``user/`**
- 在 `datasets/hf/jobs/outputs` 之外,新增一个所有用户可读写的共享区:
- `${SHARED_ROOT}/common/`:共享模型/数据/代码快照(多版本 verl / 公共数据)
- `${SHARED_ROOT}/user/`:用户自定义代码(例如 `reward_fn_path` 指向这里)
- v1.1 默认策略:先假设“所有用户可写”(后续 v3 再做权限与隔离)。
4) **可排障基础**
- 每个 job 目录必须有:
- `config/`提交命令、JobSpec 快照、关键 env_vars
- `logs/`Ray job logs + hydra logs如有
- `checkpoints/`:按 `save_freq` 控制频率(默认每 10 step
- 提供“失败快照”能力:收集 `ray status` / `ray job list` / `ray list nodes` / `ray list actors`(最少其中 2 项)写入 job 目录。
- v1.1 submitter 默认落盘:
- `${SHARED_ROOT}/jobs/<id>/config/job_spec.json`
- `${SHARED_ROOT}/jobs/<id>/config/runtime_env.json`
- `${SHARED_ROOT}/jobs/<id>/config/submit_cmd.txt`
- `${SHARED_ROOT}/jobs/<id>/logs/ray_job_submit.out`
- `${SHARED_ROOT}/jobs/<id>/debug/ray_status_{pre,post}.txt`
- `${SHARED_ROOT}/jobs/<id>/debug/ray_job_list_post.txt`
### 2.2 Workload 扩展(至少新增 1 个)
v1.1 需要新增并验收通过两个 workload都要跑通闭环
- **GRPO on Ray**(推荐优先,复用 PPO 入口,通过算法配置切换)
- 基于 `python -m verl.trainer.main_ppo`
- 通过配置覆盖:`algorithm.adv_estimator=grpo`(以及必要的 rollout 参数)
- **SFT on RayRay-native**
- 入口:`python -m verl.trainer.sft_trainer_ray`
- 参考实现:`verl/verl/trainer/sft_trainer_ray.py`(内部会 `ray.init()`
- 需要确保 `ray.init()` 连接已有集群:
- 优先:`runtime_env.env_vars.RAY_ADDRESS=auto`(配合 `ray job submit`
- 兜底:在 v1.1 的 launcher 脚本里显式 `ray.init(address="auto")` 再调用 trainer避免依赖 Ray 的 env var 行为差异)
- 重要细节Ray Job 的 entrypointdriver默认不分配 GPU因此 SFT driver 侧不要强依赖 CUDA
- 推荐:`trainer.device=cpu`driver 只做 orchestration训练由 Ray workers 占 GPU
---
## 3. v1.1 关键设计点
### 3.1 多版本代码与自定义逻辑(为 v3.1 铺路,但 v1.1 先做最小验证)
已确定优先方案A通过 **Ray Job 的 `runtime_env.env_vars`** 注入 `PYTHONPATH`
- `code_path`(例如 `${SHARED_ROOT}/common/code/verl/<commit>`
- 提交 job 时设置:
- `runtime_env.env_vars.PYTHONPATH = "<code_path>:$PYTHONPATH"`
并约定:
- `reward_fn_path` 可指向 `${SHARED_ROOT}/user/code/...` 下用户自定义代码
- 与 `code_path` 一样,必须通过 `runtime_env.env_vars` 确保该路径可被 import例如把 `${SHARED_ROOT}/user/code` 也加入 `PYTHONPATH`
v1.1 中至少做一次“代码覆盖验证”:
- 在 code_path 下放一个可识别的 `verl` 版本标识(例如 `verl.__version__` 打印差异)
- 提交 job 并在日志中确认 import 的是 code_path 的版本(而不是镜像内默认安装)
v1.1 的最小落地方式(已实现):
- 提供代码快照脚本:`src/mvp/v1.1/scripts/31_snapshot_verl_code.sh`
- 会把 `/workspace/verl`(挂载的 repo复制到 `${SHARED_ROOT}/common/code/verl/<code_id>/`
- 并写入 `${code_path}/mvp_marker.py`,用于在 Ray job logs 中验证“选用的是哪份 code_path”
- submitter 会在 entrypoint 前运行 preflight
- 打印 `verl.__file__``mvp_marker.MARKER`
- 由此确认 job 粒度的 PYTHONPATH 生效,且不同 job 可指向不同 `code_path`(多版本共存)
### 3.2 Checkpoint 策略(磁盘保护)
- 默认:`save_freq=10`(每 10 step 保存一次)
- 对于 step 数已知的短任务(例如 29 steps可以通过配置把 `save_freq` 调整为 10/15/29按需求权衡
- 作业目录按 `submission_id` 隔离,方便清理与归档
---
## 4. v1.1 交付物清单(代码 + 文档)
### 4.1 代码(建议落点)
`src/mvp/` 下新增 v1.1 级别的提交器与模板(或在 `src/mvp/v1` 原地演进但要保持 v1 可回归):
- `src/mvp/v1.1/`
- `docker-compose.yaml`(与 v1 互不干扰的容器名/网络名)
- `scripts/`Ray 启动/prepare 保留 bashsubmit 通过 SDK 工具执行)
- `py/`工程化提交层YAML + Ray Python SDK
- `py/configs/`Ray 基础配置)
- `py/jobspecs/`(训练 JobSpec
- `py/run.py`(入口)
此外,为了对齐 dev 环境约束(远程机固定目录):
- 远程机目录必须新增:`argus@h1:/home2/argus/infra/mvp/v1.1/`
- 该目录内需包含 v1.1 的全部内容compose + scripts + README可由本 repo 的 `src/mvp/v1.1/` 同步过去
### 4.2 文档
- `specs/mvp/v1.1/v1.1_action.md`:开发、部署、测试、验收流程(可复现)
- 更新 `specs/mvp/mvp_roadmap.md`:保持路线图与落地一致(按需)
---
## 5. v1.1 验收标准DoD
### 5.1 Hardening DoD
- [ ] 所有提交均由 head 执行 `ray job submit`,且显式 `--submission-id=<id>`
- [ ] 共享根路径由 `SHARED_ROOT` 控制dev/prod 可切换),脚本无硬编码
- [ ] 每个 job 的输出目录为:`${SHARED_ROOT}/jobs/<submission_id>/`
- [ ] checkpoint 不会“每 step 保存”导致爆盘:默认 `save_freq=10`
- [ ] job 失败时,`${SHARED_ROOT}/jobs/<id>/config/` 中有足够信息定位命令、env、ray 状态快照)
- [ ] v1.1 测试前会清理 v1 的遗留容器/进程避免端口、容器名、Ray session 干扰)
### 5.2 Workload DoDGRPO + SFT 都必须)
GRPO必须
- [ ] `algorithm.adv_estimator=grpo` 的 job 可提交并进入 RUNNING
- [ ] job 能跑完最小训练步数(可设 `total_epochs=1``total_training_steps`
- [ ] 输出目录内有日志与至少 1 次 checkpoint或明确不保存并说明原因
SFT必须
- [ ] `sft_trainer_ray` 可连接集群并跑到至少 1 个 step推荐最小训练步数/epoch
- [ ] 输出目录与 checkpoint 策略同 v1.1 规范(落盘到 `${SHARED_ROOT}/jobs/<id>/...`

View File

@ -0,0 +1,148 @@
# MVP v1.1 工程化重构方案Ray Python SDK 提交层YAML Config + YAML JobSpec
本文档把 v1.1 的“代码工程化”目标落到一个明确的设计:**保留现有 scripts**Ray 集群构建、数据准备、模型准备、代码快照),将“任务提交机制”重构为 **Ray Python SDK**`ray.job_submission.JobSubmissionClient`)驱动的 Python 工具层。
> 约束(已确认)
> 1) 基础配置用 YAMLJobSpec 也用 YAML。
> 2) 工具必须在 **head 容器**执行(从 head 发起提交,满足“在 head 提交”的要求)。
> 3) 训练参数组织保持与现在一致:仍然使用 **Hydra overrides** 方式构造 entrypoint。
> 4) 不使用 `requests` 直连 HTTP API只用 Ray SDK
---
## 1. 当前 Ray SDK 能力验证(关键前提)
在 head 容器(`mvp11-ray-head`)中验证:
- Ray 版本:`2.51.1`
- `JobSubmissionClient.submit_job` 支持以下关键字段:
- `submission_id`
- `runtime_env`
- `entrypoint_num_cpus`
- `entrypoint_num_gpus`
- `entrypoint_resources`(用于强制 driver 落 worker
因此 v1.1 可以“纯 SDK”完成提交不需要 `requests` fallback。
---
## 2. 系统分层(不动 scripts只重构提交层
### 2.1 scripts保留
`src/mvp/v1.1/scripts/` 继续负责:
- 容器生命周期:`01_up.sh` / `02_down.sh`
- Ray 启动:`20_start_head.sh` / `21_start_workers.sh`
- 数据/模型准备:`30_prepare_data_and_model.sh`
- 代码快照:`31_snapshot_verl_code.sh`(生成 `${SHARED_ROOT}/common/code/verl/<code_id>/`
scripts 可以新增一个“薄封装”脚本,负责 `docker exec` 进 head 容器并运行 Python 提交器,但 scripts 不再拼 `ray job submit ...` CLI 字符串。
### 2.2 Python 工具层(新增)
`src/mvp/v1.1/py/` 新增提交工具层:
- 读取 Ray 基础配置YAML
- 读取训练 JobSpecYAML
- 用 Ray Python SDK 提交/查询/停止/拉日志
- 将 job 级别产物落盘到:`${SHARED_ROOT}/jobs/<submission_id>/...`
---
## 3. 输入定义:两份 YAML
### 3.1 Ray 基础配置RayConfig YAML
这份配置是“稳定可复用”的,描述 cluster 与 driver placement 等通用信息。
字段建议:
- `address`: `http://127.0.0.1:8265`(从 head 容器内部视角)
- `shared_root`: `/private`
- `entrypoint_num_cpus`: `1`
- `entrypoint_resources`: `{"worker_node": 1}`(强制 driver 使用 worker 才有的资源)
- `runtime_env.env_vars`: HF cache / endpoint 等通用环境变量
- `user_code_path`: `${shared_root}/user/code`(可选,默认值也可)
### 3.2 训练 JobSpecJobSpec YAML
这份配置是“一次训练”语义,描述 workload + 训练参数 + code_path 多版本等。
字段建议:
- `workload`: `ppo|grpo|sft`
- `submission_id`: 可选(不填则生成;但最终必须显式传给 SDK
- `code_path`: `${shared_root}/common/code/verl/<code_id>`(多版本关键字段)
- `model_id`
- 数据路径:`train_file` / `val_file`(按 workload
- 训练参数:`nnodes` / `n_gpus_per_node` / `total_training_steps` / `save_freq` / `test_freq`
注意SFT 的 driver 设备选择):
- Ray job 的 entrypointdriver默认不分配 GPU我们通常不设置 `entrypoint_num_gpus`)。
- `sft_trainer_ray.py` 的 driver 会用 `trainer.device` 做张量统计;若设置为 `cuda` 且 driver 无 GPU会报
- `RuntimeError: No CUDA GPUs are available`
- 因此 v1.1 的 SFT JobSpec 默认应设置:`trainer.device=cpu`(训练 workers 仍会占用 GPU
---
## 4. Python 提交器的职责tool class
建议实现 `RayJobTool`(或类似命名),能力:
### 4.1 submit核心
输入:`RayConfig + JobSpec`
输出:`submission_id`
实现要点:
- `client = JobSubmissionClient(address)`
- 生成/确定 `submission_id`
- `runtime_env` 合并逻辑:
- 合并 config 与 jobspec 的 `env_vars`
- 强制注入多版本:
- `PYTHONPATH = "<code_path>:<user_code_path>:$PYTHONPATH"`
- 构造 entrypoint保持 hydra overrides 风格):
- PPO/GRPO`python3 -m verl.trainer.main_ppo ...`
- SFT`python3 -m verl.trainer.sft_trainer_ray ...`
- 强制 driver 落 worker
- `entrypoint_resources=config.entrypoint_resources`
- `entrypoint_num_cpus=config.entrypoint_num_cpus`
- 落盘产物:
- `${shared_root}/jobs/<id>/config/{ray_config.yaml,jobspec.yaml,submit_payload.json}`
- `${shared_root}/jobs/<id>/logs/submit.out`
- `${shared_root}/jobs/<id>/debug/{ray_status_pre,ray_job_list_post}.txt`(可用 SDK 或 `ray status` 采集)
### 4.2 status / stop / logs / list
- `status(submission_id)`
- `stop(submission_id)`
- `logs(submission_id)`(可支持 tail
- `list()`
---
## 5. `run.py` 入口(必须在 head 容器执行)
建议入口:
- `python3 /workspace/mvp/v1.1/py/run.py --config <ray_config.yaml> --jobspec <jobspec.yaml> --action submit`
- `--action` 支持:`submit|status|stop|logs|list`
host 侧执行方式(由 scripts 薄封装):
- `docker exec mvp11-ray-head python3 /workspace/mvp/v1.1/py/run.py ...`
---
## 6. 验收口径(工程化部分)
1) **SDK 提交**:不使用 `ray job submit` CLI改用 `JobSubmissionClient.submit_job`
2) **driver 仍强制在 worker**SDK 提交时 `entrypoint_resources={"worker_node":1}` 生效。
3) **多版本共存验证**
- 通过 `31_snapshot_verl_code.sh` 生成 `codeA/codeB` 两份 code_path
- 通过两份 JobSpec 分别指向不同 `code_path`
- 在 job logs 中看到不同的 marker例如 `mvp_marker.MARKER`

View File

@ -0,0 +1,333 @@
# MVP v1.1 行动文档(实施方案 / 部署测试 / 验收口径)
本文档面向“把 v1 跑通的实验脚本,升级为可长期回归的 v1.1 最小系统”,并给出**开发改造 → 部署测试 → 验收**的可复现流程。
> v1.1 的核心约束(来自讨论结论)
> - 仍然必须通过 **head 节点执行 `ray job submit`** 提交任务。
> - 训练/driver **必须落在 worker**head 不跑训练)。
> - 多版本 `verl` 共存:同一镜像不变,必须通过 **Ray Job `runtime_env.env_vars` 注入 `PYTHONPATH`** 让 job 粒度选择代码版本。
> - 存储只考虑 NFSdev 环境我们自己 mount生产环境容器内统一看到 `/private/`
---
## 1. 目标与非目标
### 1.1 目标v1.1 必须做到)
1) **可回归**:同一环境连续跑多次 PPO 回归,不互相覆盖,输出按 submission id 归档。
2) **可扩展**:新增并验收通过 2 个 workload**GRPO + SFT**)并跑通闭环。
3) **可排障**:每个 job 目录包含完整的提交快照、关键 env、Ray 状态快照与日志入口。
4) **可多版本共存**:同一 Ray 集群内,不同 job 通过 `PYTHONPATH` 选择不同 `verl` 代码快照。
### 1.2 非目标v1.1 不做)
- 不做平台 API/队列/多租户/RBAC这是 v2/v3
- 不做复杂调度拓扑、IB 域、NUMA、Gang 等自动化策略)。
---
## 2. 运行环境约定dev / prod 一致抽象)
### 2.1 拓扑(单机 3 容器)
- `mvp-ray-head`:无 GPU`ray start --head --num-cpus=0 --num-gpus=0`(控制面 only
- `mvp-ray-worker-0`4 GPU
- `mvp-ray-worker-1`4 GPU
### 2.2 “head 不跑训练”的硬约束实现(必须)
1) **head CPU=0**:从资源层面阻断默认 task/driver 落到 head。
2) **worker 自定义资源标签**worker 启动时带 `--resources='{"worker_node": 100}'`
3) **ray job submit 强制 entrypoint 落 worker**:提交时必须带:
- `--entrypoint-resources='{"worker_node": 1}'`
- `--entrypoint-num-cpus=1`(显式声明 driver 需要的 CPU
> 验证口径:`ray job list``driver_info.node_ip_address` 必须是 worker 的 IP而不是 head IP。
### 2.3 共享存储NFS与路径关键
- 生产环境:容器内共享根路径固定为 `/private/`(算力平台统一挂载 NFS
- 开发环境docker compose 也应把宿主机共享目录挂载到容器内的 `/private/`,从而做到 dev/prod 一致。
统一约定(容器内视角):
- `SHARED_ROOT=/private`
- Job 输出:`${SHARED_ROOT}/jobs/<submission_id>/`
建议的共享目录结构v1.1 新增 `common/``user/`
- `${SHARED_ROOT}/datasets/`:通用数据(例如 gsm8k parquet
- `${SHARED_ROOT}/hf/`HuggingFace cache模型/分词器/权重)
- `${SHARED_ROOT}/jobs/`:按 submission id 归档的作业目录(强制)
- `${SHARED_ROOT}/outputs/`:临时/非强约束输出(不建议长期依赖)
- `${SHARED_ROOT}/ray/`Ray 调试痕迹(可选,通常 Ray 默认写 `/tmp/ray`
- `${SHARED_ROOT}/common/`:所有用户可读写共享区(模型/数据/代码快照)
- `${SHARED_ROOT}/common/models/`:可复用基础模型(可用软链指向 hf cache 或 snapshot
- `${SHARED_ROOT}/common/datasets/`:共享数据(或与 `datasets/` 统一规划)
- `${SHARED_ROOT}/common/code/`:代码快照(多版本 `verl` / 自定义 reward
- `${SHARED_ROOT}/user/`:用户自定义内容(默认所有用户可写)
- `${SHARED_ROOT}/user/code/`reward_fn 等自定义 Python 代码
---
## 3. 开发实施方案(代码改造清单)
> v1.1 建议新增 `src/mvp/v1.1/`(保持 v1 可回归不被破坏)。
### 3.1 JobSpec最小标准化
v1.1 的工程化目标是把“提交机制”迁移到 Ray Python SDK因此输入拆为两份 YAML
1) Ray 基础配置YAMLaddress / entrypoint resources / runtime_env 等
2) 训练 JobSpecYAMLworkload 语义与训练参数(仍由 Hydra overrides 组织)
训练 JobSpecYAML至少包含
- `submission_id`:可空;为空时由 submitter 生成(但最终必须显式传给 `ray job submit --submission-id`
- `workload``ppo` / `grpo` / `sft`v1.1 必须 `ppo` + `grpo` + `sft`
- `shared_root`:默认 `/private`(容器内路径)
- `code_path``verl` 代码快照目录(用于多版本共存)
- `reward_fn_path`(可选):指向 `${shared_root}/user/code/...` 下的 Python 文件或模块入口
- `model` / `dataset`:必须指向共享存储的持久化路径(避免每次下载/生成)
- `ray``address=http://127.0.0.1:8265`(从 head 容器内部视角)
- `resources`
- `entrypoint_resources={"worker_node":1}`
- `entrypoint_num_cpus=1`
- `trainer_overrides`训练参数覆盖v1.1 默认 `total_epochs=1``save_freq=10`
- `env_vars`:会被透传到 `runtime_env.env_vars`(必须包含 `PYTHONPATH` 注入)
交付物v1.1 SDK 方式):
- `src/mvp/v1.1/py/configs/dev.yaml`Ray 基础配置示例)
- `src/mvp/v1.1/py/jobspecs/{ppo,grpo,sft}.yaml`(训练 JobSpec 示例)
- `src/mvp/v1.1/py/run.py`(入口:使用 Ray Python SDK 提交/查询/停止/拉日志)
- 设计文档:`specs/mvp/v1.1/sdk_submit_refactor.md`
### 3.2 多版本 `verl` 共存(必须)
原则:**镜像固定不变**job 粒度通过 `PYTHONPATH` 选择 `verl` 代码快照。
提交时必须注入runtime_env
- `PYTHONPATH="<CODE_PATH>:$PYTHONPATH"``CODE_PATH` 放最前面)
并要求 job 在日志中打印一行确认 import 来源,例如:
- `python -c "import verl,inspect; print(verl.__file__)"`(或训练入口启动时打印)
v1.1 具体实现(可复现):
- 先用 `src/mvp/v1.1/scripts/31_snapshot_verl_code.sh` 生成代码快照目录 `${SHARED_ROOT}/common/code/verl/<code_id>/`
- 该目录里会包含一个 `mvp_marker.py``MARKER=<code_id>`
- 提交 job 时让 `code_path` 指向该快照目录submitter 会在 entrypoint 前打印:
- `MVP_PRECHECK_VERL_FILE`(验证 import 来源)
- `MVP_PRECHECK_MARKER`(验证选择的 code_path
### 3.3 `submit_job` 工具(组装 ray job submit
新增一个提交器(建议 Python避免复杂 bash quoting
- 输入JobSpec JSON
- 产物:
- 生成/确定 `submission_id`
- 创建 `${SHARED_ROOT}/jobs/<id>/config/``logs/``checkpoints/`
- 写入 `config/job_spec.json`(原样快照)
- 写入 `config/runtime_env.json`(最终用于 submit 的 JSON
- 写入 `config/submit_cmd.txt`(最终命令行)
- 执行:在 **head 容器内**运行 `ray job submit ...`
### 3.4 可排障debug bundle强制落盘
在 job 生命周期的关键节点收集并落盘(至少 2 类):
- `ray status`
- `ray job list`
- `ray list nodes`
- `ray list actors`
建议落盘到:
- `${SHARED_ROOT}/jobs/<id>/debug/`(每次收集带时间戳文件名)
### 3.5 Workload 扩展GRPOv1.1 新增闭环)
优先用与 PPO 相同入口 `python -m verl.trainer.main_ppo`,仅通过配置切换算法:
- `algorithm.adv_estimator=grpo`
- 其余保持最小可跑:`total_epochs=1``save_freq=10`
### 3.6 Workload 扩展SFT on Rayv1.1 必须新增闭环)
#### 3.6.1 入口与参考实现
- 入口:`python -m verl.trainer.sft_trainer_ray`
- 参考代码:`verl/verl/trainer/sft_trainer.py`(非 Ray 版本)与 `verl/verl/trainer/sft_trainer_ray.py`Ray 版本)
> v1.1 要验收的是 “SFT on Ray”因此默认使用 `sft_trainer_ray.py`
#### 3.6.2 连接已有 Ray 集群(必须)
`sft_trainer_ray.py` 内部直接调用 `ray.init()`,为了确保它连接到**已有集群**head+workersv1.1 约定:
- 提交 job 时通过 `runtime_env.env_vars` 注入:`RAY_ADDRESS=auto`
如果发现 `ray.init()` 未按预期读取 `RAY_ADDRESS`Ray 版本差异风险v1.1 需要提供一个 launcher 兜底:
- 由 launcher 先显式 `ray.init(address="auto")`,再调用 SFT trainer 逻辑
#### 3.6.3 SFT 数据格式parquet schema
`sft_trainer_ray` 默认使用 `MultiTurnSFTDataset`parquet 中至少需要:
- `messages`list[dict]dict 至少含 `role`/`content`
v1.1 的 `prepare` 阶段需要生成并持久化 SFT 数据,例如:
- `${SHARED_ROOT}/datasets/gsm8k_sft/train.parquet`
- `${SHARED_ROOT}/datasets/gsm8k_sft/val.parquet`(可选)
单条样本的 `messages` 形态示例:
- `[{ "role": "user", "content": "<question>" }, { "role": "assistant", "content": "<answer>" }]`
> 注意SFT parquet 不能直接复用 PPO/RL 的 parquetschema 不同)。
#### 3.6.4 重要细节SFT Ray Driver 不应依赖 GPU
`ray job submit` 模式下,我们的 entrypointdriver默认 **不会分配 GPU**(我们只指定了 `--entrypoint-num-cpus=1`,没有指定 `--entrypoint-num-gpus`)。
`verl/verl/trainer/sft_trainer_ray.py` 的 driver 逻辑里会用 `trainer.device` 来创建 `torch.tensor(..., device=...)` 做统计,如果设置为 `cuda` 且 driver 没有 GPU会触发
- `RuntimeError: No CUDA GPUs are available`
因此 v1.1 的 SFT on Ray 验收默认要求:
- `trainer.device=cpu`driver 只做 orchestration真正训练仍由 Ray 的 TrainingWorker/资源池占用 GPU
### 3.7 v1.1 脚本化交付(必须独立完整)
`src/mvp/v1.1/` 需要像 v1 一样提供一套完整脚本,确保 v1.1 可独立运行、可回归:
- `src/mvp/v1.1/docker-compose.yaml`(容器名建议与 v1 区分,避免冲突)
- `src/mvp/v1.1/scripts/00_prereq_check.sh`(含 GPU/目录/NFS/verl 代码检查)
- `src/mvp/v1.1/scripts/01_up.sh` / `02_down.sh`(起停)
- `src/mvp/v1.1/scripts/20_start_head.sh` / `21_start_workers.sh`
- `src/mvp/v1.1/scripts/30_prepare_data_and_model.sh`(包含 PPO 数据 + SFT 数据)
- `src/mvp/v1.1/scripts/40_submit_ppo_epoch1.sh`
- `src/mvp/v1.1/scripts/41_submit_grpo_epoch1.sh`
- `src/mvp/v1.1/scripts/42_submit_sft_minimal.sh`
- `src/mvp/v1.1/scripts/50_status.sh`
- `src/mvp/v1.1/scripts/31_snapshot_verl_code.sh`(多版本 code snapshot
- `src/mvp/v1.1/scripts/43_submit_jobspec.sh`(通过 JobSpec 提交)
- `src/mvp/v1.1/scripts/12_install_py_deps.sh`(安装 PyYAML 等依赖)
- `src/mvp/v1.1/scripts/44_submit_sdk.sh`(通过 Ray Python SDK + YAML 提交)
---
## 4. 部署与测试流程dev 环境)
> dev 环境以远程机目录为例:`argus@h1:/home2/argus/infra/mvp`。v1.1 的所有内容要求放在:
>
> - `argus@h1:/home2/argus/infra/mvp/v1.1/`
>
> 并在该目录中通过脚本使用 `docker exec` 协调容器。
### 4.0 清理 v1 环境(必须先做)
v1 已在 `argus@h1` 部署过容器与 Ray。为保证 v1.1 的可重复测试,开始 v1.1 前必须清理 v1
1) 停止并删除 v1 容器(推荐用 v1 的 down 脚本)
2) 确认 `docker ps` 中不再有 v1 的 `mvp-ray-head/mvp-ray-worker-*`
v1.1 的脚本里也提供了一个 best-effort 清理脚本:`src/mvp/v1.1/scripts/03_cleanup_v1_legacy.sh`(远程目录中同名脚本)。
### 4.1 环境准备(一次性 / 幂等)
1) 目录检查(远程机):
- `${WORKDIR}/shared/` 存在并具备上述子目录(含 `common/``user/`
2) `verl` 代码目录检查:
- `${WORKDIR}/verl` 不存在则执行 `git clone https://github.com/volcengine/verl.git`
3) GPU 可用性检查:
- 设备存在(例如 0-7 可见),并按 worker 容器分配(每个 worker 4 GPU
4) 模型与数据持久化路径:
- 模型与数据必须落在 `${SHARED_ROOT}` 下;若已存在则跳过下载/生成
- SFT parquet 同样必须落在 `${SHARED_ROOT}` 下;若已存在则跳过生成
### 4.2 启动 Ray 集群(每次测试)
1) `docker compose up -d`
2) head`ray start --head --num-cpus=0 --num-gpus=0 ...`
3) workers`ray start --address=<head>:6379 --resources='{"worker_node":100}' ...`
4) 验证:`ray status` 显示 1 head + 2 worker且 head `CPU:0 GPU:0`
### 4.3 提交 PPO 回归(必须跑 2 次)
1) 生成 JobSpec可用模板 + 覆盖项)
2) 在 head 容器内执行 submitter或直接 `ray job submit`
3) 验证要点:
- `ray job list`driver node 是 worker
- `${SHARED_ROOT}/jobs/<id>/` 下存在 `config/``logs/``checkpoints/`
- checkpoint 每 10 step 产生(例如 `global_step_10`
### 4.4 提交 GRPO新增 workload 验收)
同 PPO但覆盖 `algorithm.adv_estimator=grpo`,确保能进入 RUNNING 并完成最小步数。
### 4.5 提交 SFT on Ray新增 workload 验收,必须)
1) 确认 `${SHARED_ROOT}/datasets/gsm8k_sft/train.parquet` 已存在(由 v1.1 prepare 生成)。
2) 通过 head 容器执行 `ray job submit` 提交 `python -m verl.trainer.sft_trainer_ray`
3) 关键约束:
- `runtime_env.env_vars.RAY_ADDRESS=auto`(连接已有集群)
- `--entrypoint-resources='{"worker_node": 1}'`driver 落 worker
- `PYTHONPATH=<code_path>:$PYTHONPATH`(多版本 verl
4) 最小化训练配置建议(避免 OOM/耗时过长):
- `trainer.total_epochs=1`
- `trainer.total_training_steps=10~30`
- `trainer.save_freq=10`
- `trainer.nnodes=2``trainer.n_gpus_per_node=4`(用满 8 卡做一次最小分布式验证)
- `data.train_files=${SHARED_ROOT}/datasets/gsm8k_sft/train.parquet`
- `trainer.default_local_dir=${SHARED_ROOT}/jobs/<id>/checkpoints`
### 4.6 工程化验证JobSpec + 多版本共存v1.1 必须)
1) 生成两个 code snapshot不同 `CODE_ID`
- `CODE_ID=codeA ./scripts/31_snapshot_verl_code.sh`
- `CODE_ID=codeB ./scripts/31_snapshot_verl_code.sh`
2) 分别修改/复制 JobSpec 模板,使 `code_path` 指向不同 snapshot
- `${SHARED_ROOT}/common/code/verl/codeA`
- `${SHARED_ROOT}/common/code/verl/codeB`
3) 用 JobSpec 提交(必须从 head
- `./scripts/43_submit_jobspec.sh /workspace/mvp/v1.1/templates/ppo.json`(示例)
4) 在 Ray job logs 中验证:
- `MVP_PRECHECK_MARKER` 打印为对应的 `codeA`/`codeB`
- `MVP_PRECHECK_VERL_FILE` 指向 `${SHARED_ROOT}/common/code/verl/...` 而不是镜像内 site-packages
---
## 5. 验收标准Definition of Done
### 5.1 Hardening DoD全部必选
- [ ] 提交必须来自 head能在 head 容器内看到 `ray job submit ...` 的提交记录
- [ ] driver 不在 head`ray job list``driver_info.node_ip_address` ∈ worker IP且 ≠ head IP
- [ ] 输出目录按 submission id 隔离:`${SHARED_ROOT}/jobs/<submission_id>/` 不复用、不覆盖
- [ ] 数据/模型持久化:再次提交时不重复下载/生成(有 “skip if exists” 的日志)
- [ ] checkpoint 策略有效:默认 `save_freq=10`,不会每 step 保存爆盘
- [ ] debug bundle 落盘:`${SHARED_ROOT}/jobs/<id>/debug/` 至少包含 2 类 Ray 状态快照
- [ ] 多版本共存验证通过:日志中能确认 `verl` import 来源来自 JobSpec 指定的 `code_path`
### 5.2 Workload DoDGRPO + SFT 都必须)
- [ ] GRPO job 能提交、RUNNING、完成最小训练步数
- [ ] GRPO job 产物目录满足与 PPO 相同的目录规范与 debug 规范
- [ ] SFT job 能提交、连接已有集群并跑到至少 1 个 step建议最小步数/epoch
- [ ] SFT job 产物目录满足与 PPO 相同的目录规范与 debug 规范
---
## 6. 生产环境部署注意事项v1.1 需要考虑但不强制在 dev 全量模拟)
- 容器由算力平台创建:我们只负责 SSH 进去纳管(启动 ray / 提交 job / 收集产物)。
- 容器内共享路径为 `/private`:所有脚本必须以 `SHARED_ROOT=/private` 工作,不得写死 `/mnt/shared`
- 认证仅内部 token在 submitter 中把 token 作为 env var 透传(不写入日志明文)。

194
specs/mvp/v2.0/v2_api.md Normal file
View File

@ -0,0 +1,194 @@
# MVP v2.0 API 设计(最小可用)
v2.0 的 API 目标是:把 v1.1 的“脚本提交”变成“服务化提交”,并在服务侧实现队列/重试/状态聚合。
约束:
- 内部 token 鉴权(简单即可)。
- Ray Job 提交必须使用 **Ray Python SDK**`JobSubmissionClient`),不使用 `requests` 手写 HTTP。
- 输出与状态必须落盘到 NFS容器内 `/private`)。
---
## 1. 鉴权
- Header`Authorization: Bearer <INTERNAL_TOKEN>`
- v2.0 不做用户体系与权限隔离token 只是“防误用”。
- 配置建议:复用 `src/mvp/v1.1/py/configs/dev.yaml` 并在 `v2.auth.token_env` 指定 token 环境变量名。
## 1.1 运行位置dev 示例)
- 服务进程运行在 **Ray head 容器**(便于访问 Ray Job server
- 宿主机侧用脚本控制(`docker exec`
- `src/mvp/v2.0/scripts/20_start_api.sh`
- `src/mvp/v2.0/scripts/21_stop_api.sh`
- `src/mvp/v2.0/scripts/22_status_api.sh`
- 远程机目录约定(示例):`argus@h1:/home2/argus/infra/mvp/v2/`,容器内挂载到 `/workspace/mvp/v2/`
---
## 2. 资源与 ID 约定
### 2.1 task_id服务层主 ID
- 格式建议:`mvp2-<workload>-<YYYYMMDD>-<HHMMSS>-<suffix>`
- 示例:`mvp2-ppo-20251223-143201-7f3a`
### 2.2 ray_submission_idattempt 级 ID
- 由 service 派生:`<task_id>--a<NN>`
- 示例:`mvp2-ppo-20251223-143201-7f3a--a01`
好处:
- Ray 的 submission id 自带 task_id可直接从 Ray dashboard 反查到服务侧任务。
- `/private/jobs/<ray_submission_id>/...` 目录天然隔离且可读。
---
## 3. JobSpec请求体
v2.0 **要求 JobSpec 使用 v1.1 同款 YAML**(字段与语义保持一致),服务端接收 YAML 文本并解析后入库(同时原样保存 `jobspec_yaml` 便于审计/复现)。
最小字段(示例 YAML
```yaml
workload: "ppo"
submission_id: "" # v2.0 服务端会忽略/覆盖(由 task_id 派生 ray_submission_id
code_path: "/private/common/code/verl/verl_repo"
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
train_file: "/private/datasets/gsm8k/train.parquet"
val_file: "/private/datasets/gsm8k/test.parquet"
nnodes: 2
n_gpus_per_node: 4
total_epochs: 1
total_training_steps: 10
save_freq: 10
test_freq: -1
trainer_device: null # 仅 sft 使用(通常 "cpu"
```
说明:
- `trainer_device` 仅对 `sft` 生效(通常为 `cpu`,避免 driver 无 GPU
- `val_file` 可为 `null`(例如 SFT
---
## 4. API 端点
### 4.1 提交任务
`POST /api/v2/tasks`
Request body
- **raw JobSpec YAML**(与 v1.1 jobspec YAML 结构一致)
Headers
- `Content-Type: application/yaml`(或 `text/yaml`
Response
```json
{
"task_id": "mvp2-ppo-20251223-143201-7f3a",
"state": "QUEUED"
}
```
### 4.2 查询任务(聚合状态)
`GET /api/v2/tasks/{task_id}`
Response示例
```json
{
"task_id": "mvp2-ppo-20251223-143201-7f3a",
"workload": "ppo",
"state": "RUNNING",
"desired_resources": {"nnodes": 2, "n_gpus_per_node": 4, "total_gpus": 8},
"latest_attempt": {
"attempt_no": 1,
"ray_submission_id": "mvp2-ppo-20251223-143201-7f3a--a01",
"ray_status": "RUNNING",
"start_time": "2025-12-23T14:32:10+08:00"
},
"error_summary": null
}
```
### 4.3 列出 attempts
`GET /api/v2/tasks/{task_id}/attempts`
Response
```json
{
"task_id": "mvp2-ppo-20251223-143201-7f3a",
"attempts": [
{
"attempt_no": 1,
"ray_submission_id": "mvp2-ppo-20251223-143201-7f3a--a01",
"ray_status": "FAILED",
"failure_kind": "INSUFFICIENT_RESOURCES",
"message": "Total available GPUs 0 is less than total desired GPUs 8",
"start_time": "...",
"end_time": "..."
}
]
}
```
### 4.4 取消任务
`POST /api/v2/tasks/{task_id}:cancel`
行为:
- 若 task 处于 `SUBMITTED/RUNNING`:调用 Ray Jobs SDK `stop_job(ray_submission_id)` 并标记 `CANCELED`
- 若处于 `QUEUED/PENDING_RESOURCES`:直接标记 `CANCELED`(不提交)
Response
```json
{"task_id":"...","state":"CANCELED"}
```
### 4.5 获取日志
`GET /api/v2/tasks/{task_id}/logs?attempt=latest&tail=2000`
返回:
- `text/plain`(直接透传 Ray Job logs tail
说明:
- v2.0 先用 Ray SDK `get_job_logs()`
- 若需要更稳定的归档,可在 scheduler 定期抓取并落盘v2.1+)。
### 4.6 列出队列(运维/调试)
`GET /api/v2/queue`
Response
```json
{
"pending": [{"task_id":"...","state":"PENDING_RESOURCES","next_run_at":"..."}],
"running": [{"task_id":"...","ray_submission_id":"..."}]
}
```
---
## 5. 错误码(最小)
- `400`jobspec 缺字段/非法
- `401`token 不正确
- `404`task 不存在
- `409`:状态冲突(例如已终态又 cancel
- `500`:服务内部错误
---
## 6. SQLite 持久化API 可见性)
v2.0 服务端使用 SQLite 持久化保存:
- tasks`task_id``state``jobspec_yaml``next_run_at``latest_attempt_no` 等)
- attempts`ray_submission_id``ray_status`、失败原因等)
因此:
- `GET /api/v2/tasks/{task_id}` 的数据来自 SQLite再叠加 Ray 状态同步的结果)。
- 进程重启后,队列可恢复,`PENDING_RESOURCES` 的任务会在 `next_run_at` 到期后继续尝试提交。

306
specs/mvp/v2.0/v2_plan.md Normal file
View File

@ -0,0 +1,306 @@
# MVP v2.0 开发计划(服务化入口 + 队列调度 + Ray Jobs SDK
目标:在 v1.1(脚本 + Ray Jobs SDK已验收通过的基础上交付一个**可独立运行的最小“服务层”**
- 用户通过 **HTTP API** 提交训练任务PPO/GRPO/SFT
- 服务层分配一个**人类易读的任务 ID**`task_id`),并把任务放入队列。
- 后台调度器在资源满足时再向 Ray 集群提交 Ray Job并持续追踪 Ray Job 状态。
- 针对 `verl`**fail-fast 资源预检查**(资源不足直接 `ValueError` 失败)做“服务级重试/排队”,避免用户反复手工提交。
> 约束继承 v1.1head 不跑训练driver 必须落到 worker共享存储只考虑 NFS容器内 `/private`)。
---
## 1. 背景:为什么 v2.0 需要“服务层调度”
在 v1.1 中我们通过 Ray Job 提交 `verl` 训练任务。`verl` PPO/GRPO 在初始化 worker 时会创建资源池,并做一次 fail-fast 的资源检查:
- 触发点:`ResourcePoolManager.create_resource_pool()` 末尾调用 `_check_resource_available()`
- `_check_resource_available()` 使用 `ray._private.state.available_resources_per_node()` 统计“可用 GPU/NPU”如果不足则直接抛异常
- `ValueError: Total available GPUs 0 is less than total desired GPUs 8`
这是一种合理的选择(避免 Ray 层面无限 pending/卡死),但会带来一个平台侧问题:
- 当集群暂时没有足够资源时,用户提交会“立刻失败”,需要手动重试。
因此 v2.0 的服务层要提供:
- **队列 + gang 约束**:资源不满足则任务在服务层 pending不提交到 Ray
- **状态追踪**:一旦提交到 Ray持续获取 Ray Job 状态并回传给用户。
- **资源不足的“自动重试”**:即使发生 race提交时资源够、启动时被抢走也能识别该类失败并延迟重试。
---
## 2. v2.0 交付范围Scope
### 2.1 必做MVP v2.0
1) **HTTP API**(内部 token
- 提交任务、查询任务、取消任务、拉取日志(最小可用)。
2) **任务队列与调度器**
- FIFO先到先服务无配额/公平性(留给 v3+)。
- gang`nnodes` + `n_gpus_per_node` 的固定资源需求“全有才提交”。
3) **Ray Jobs SDK 集成**(不使用 `requests` 自己拼 HTTP
- 通过 `ray.job_submission.JobSubmissionClient` submit/status/stop/logs。
4) **可观测/可排障最小集**
- 每个 task/attempt 落盘配置、提交载荷、Ray 返回的 `submission_id`、关键日志。
5) **失败策略**
- 识别 “资源不足 fail-fast” 类失败 → 转为 `PENDING_RESOURCES` 并延迟重试。
- 其他失败保持 `FAILED`(不自动重试,避免掩盖错误)。
### 2.2 不做v2.0 不实现)
- 多租户/配额/优先级/公平性调度v3
- Pipeline多 job 串联v3+)。
- 完整 UIv3+v2.0 可只提供 OpenAPI/Swagger
- K8s 编排(明确不做,仍是 Native Ray
---
## 2.3 工程原则(开闭原则 / 复用 v1.1
v2.0 研发遵循开闭原则Open/Closed Principle
- **对扩展开放**新增“服务层API + scheduler + SQLite”能力以支持排队、重试、状态聚合。
- **对修改关闭**:尽量不改动 v1.1 已经稳定可用的 Ray Jobs SDK 提交链路代码。
落地方式:
- 将 `src/mvp/v1.1/py/mvp_v11/` 作为“成熟可用提交层”,原样拷贝到 `src/mvp/v2.0/py/mvp_v11/` 供 v2.0 复用。
- v2.0 的新增功能全部在新模块实现(例如 `src/mvp/v2.0/py/mvp_v2/`),通过组合/封装来调用 `mvp_v11`,避免在旧代码中掺杂平台逻辑。
---
## 3. 总体架构v2.0
### 3.1 组件
- **mvp-api**HTTP Server
- 接收 JobSpec结构化字段保持与 v1.1 一致的语义)
- 生成 `task_id` 并写入持久化
- 提供 query/cancel/logs
- **mvp-scheduler**(后台调度器,可与 api 同进程也可拆进程)
- 轮询队列:对 `PENDING_RESOURCES` 的任务做资源判断
- 资源满足 → 调用 Ray Jobs SDK 提交 → 记录 `ray_submission_id`
- 对 `SUBMITTED/RUNNING` 的任务持续同步 Ray Job 状态
- 如果 Ray Job 失败且命中资源不足模式 → 延迟重试
> 部署建议v2.0 先在 **head 容器**内运行该服务dev/prod 行为一致;生产环境只能 ssh 进入容器纳管)。
### 3.4 dev 环境目录约定(示例)
以当前远程开发机为例(`argus@h1`
- 宿主机目录:`/home2/argus/infra/mvp/v2/`
- 容器内挂载:`/workspace/mvp/v2/`
- 共享 NFS容器内统一为 `/private/`(与 v1.1 保持一致)
> 注意:服务脚本(`v2/scripts/*.sh`)应在**宿主机**执行,通过 `docker exec` 控制 head 容器;训练 driver 仍通过 Ray entrypoint_resources 强制落到 worker。
### 3.2 与 Ray/容器的关系
- 服务进程运行在 head或等价能访问 head 的 Job server 地址)。
- 提交时仍使用 v1.1 的强约束:
- head`--num-cpus=0 --num-gpus=0`
- worker`--resources='{\"worker_node\": 100}'`
- job entrypoint`entrypoint_resources={\"worker_node\": 1}` 强制 driver 落 worker
---
## 3.3 配置约定(复用 v1.1 dev.yaml 并扩展)
v2.0 的服务层API + scheduler建议复用 v1.1 已存在的 RayConfig 文件:
- `src/mvp/v1.1/py/configs/dev.yaml`
原因:
- 其中已包含 v1.1 运行所需的 Ray 基础配置Ray Job server address、entrypoint_resources、runtime_env 等v2.0 也需要同样的信息来提交 Ray Jobs。
扩展方式:
- 在该 YAML 中新增一个顶层 `v2:` section存放 v2 服务专属配置API 监听、SQLite 路径、scheduler 间隔等)。
- v1.1 submitter 只读取 `address/shared_root/entrypoint_* /runtime_env/user_code_path`,会忽略 `v2:` 之类的额外字段;因此不会破坏 v1.1。
最小新增项建议(示例):
- `v2.api.host` / `v2.api.port`
- `v2.auth.token_env`(内部 token 环境变量名)
- `v2.sqlite.db_path`(建议 `/private/common/db/mvp_v2.sqlite3`
- `v2.scheduler.tick_s` / `v2.scheduler.retry_interval_s` / `v2.scheduler.max_running_tasks`
---
## 4. 核心数据模型Task / Attempt
### 4.1 Task用户视角的任务
- `task_id`**人类易读**且唯一,例如:
- `mvp2-ppo-20251223-143201-7f3a`
- `workload``ppo|grpo|sft`
- `jobspec`:提交参数(**保持 v1.1 的 jobspec YAML 字段与语义**;服务端解析 YAML 后入库)
- `state`:见第 5 节状态机
- `created_at` / `updated_at`
- `latest_attempt`:指向当前 attempt
- `attempts[]`:历史尝试列表
- `error_summary`:面向用户的简短错误(最后一次失败原因)
### 4.2 Attempt一次真实的 Ray Job 提交)
- `attempt_no`:从 1 开始递增
- `ray_submission_id`:建议派生自 task_id
- `ray_submission_id = <task_id>--a01`
- 好处Ray 侧输出目录天然可读、可追溯
- `status`Ray Job 状态PENDING/RUNNING/SUCCEEDED/FAILED/STOPPED
- `start_time` / `end_time`
- `exit_code`(如可取)
- `failure_kind`(枚举):
- `INSUFFICIENT_RESOURCES`(匹配 “Total available GPUs … less than total desired …”)
- `USER_ERROR`(配置/数据路径错误等)
- `RUNTIME_ERROR`(代码异常)
- `UNKNOWN`
---
## 5. 状态机(服务侧)
建议最小状态集:
- `QUEUED`:已入队,尚未进行资源判断
- `PENDING_RESOURCES`:资源不足,等待(服务侧 pending不提交 Ray
- `SUBMITTING`:正在向 Ray 提交 attempt
- `SUBMITTED`Ray 已接受 submission拿到 `ray_submission_id`
- `RUNNING`Ray Job RUNNING
- `SUCCEEDED`:任务成功(终态)
- `FAILED`:任务失败(终态,除非命中“资源不足重试策略”)
- `CANCELED`:用户取消(终态)
关键转换:
- `QUEUED -> PENDING_RESOURCES`:资源不足
- `QUEUED/PENDING_RESOURCES -> SUBMITTING`:资源满足
- `SUBMITTING -> SUBMITTED`:提交成功
- `SUBMITTED -> RUNNING`Ray 状态推进
- `SUBMITTED/RUNNING -> SUCCEEDED|FAILED`Ray 终态
- `FAILED (INSUFFICIENT_RESOURCES) -> PENDING_RESOURCES`进入延迟重试attempt_no+1
---
## 6. 调度策略v2.0
### 6.1 资源计算(对齐 verl 的“可用资源”口径)
由于 verl 使用 `ray._private.state.available_resources_per_node()` 做“可用资源”统计,
v2.0 的 scheduler 应该尽量使用相同口径,避免:
- 我们认为够了 → 实际 verl 认为不够(仍 fail-fast
- 我们认为不够 → 实际够了(浪费)
策略(建议):
1) scheduler 周期性获取 per-node 可用 GPU
2) 计算 total_available_gpus = sum(node_gpu_available)
3) 任务需求 total_required_gpus = nnodes * n_gpus_per_node
4) 如果 `total_available_gpus < total_required_gpus``PENDING_RESOURCES`
注意v2.0 先只做总量判断;节点级分配(保证每个 node 恰好 n_gpus_per_node可作为 v2.1+(资源池/标签/节点纳管)增强点。
### 6.2 排队与并发
- 默认 FIFO。
- 并发度:允许同时跑多个任务,但必须保证资源足够。
- 简化实现:如果任务默认都吃满 8 卡,则 scheduler 实际上一次只能跑一个。
- 若未来支持小任务1*1、1*4可以自然并发。
### 6.3 重试策略(资源不足)
当出现下面模式时判定为 `INSUFFICIENT_RESOURCES`
- Ray Job `status=FAILED`
- `JobDetails.message``job logs` 中匹配:
- `Total available GPUs``less than total desired`
处理:
- 将 task 置为 `PENDING_RESOURCES`
- `next_run_at = now + 60s`固定间隔v2.1 可改指数退避)
- attempt_no++ 后重提(新 submission id
---
## 7. SQLite 持久化(队列/状态/attempt
v2.0 引入一个**最小但可恢复的持久化层**:使用 SQLite 保存任务队列与状态,确保:
- api/scheduler 进程重启后,队列不丢;
- task/attempt 历史可追溯;
- 能实现“服务侧 pending + 延迟重试”的确定性行为。
### 7.1 存放位置
建议路径(容器内):
- `DB_PATH=/private/common/db/mvp_v2.sqlite3`
说明:
- v2.0 默认单实例服务(单 writerSQLite 足够。
- 生产环境若 NFS 上的 SQLite 有锁/性能风险v2.1+ 再演进到 Postgres/Redisv2.0 先以“可回放/可恢复”为第一目标。
### 7.2 表设计(建议最小集合)
- `tasks`
- `task_id` (PK)
- `workload`
- `state`(服务侧状态机)
- `jobspec_yaml`(原始 YAML 文本,原样落盘便于审计/复现)
- `created_at`, `updated_at`
- `next_run_at`(用于 `PENDING_RESOURCES` 的延迟重试)
- `error_summary`
- `latest_attempt_no`
- `attempts`
- `task_id` (FK)
- `attempt_no`
- `ray_submission_id`
- `ray_status`
- `failure_kind`
- `message`(截断后的关键信息)
- `start_time`, `end_time`
- `events`(可选,但非常利于排障)
- `id` (PK)
- `task_id`
- `ts`
- `event_type`STATE_TRANSITION / SUBMIT / RAY_STATUS_SYNC / RETRY_SCHEDULED 等)
- `payload_json`
### 7.3 调度循环(与 SQLite 的交互)
scheduler 每个 tick 做三件事:
1) **挑选可运行任务**FIFO + next_run_at
- `state IN ('QUEUED','PENDING_RESOURCES') AND next_run_at <= now`
2) **资源判断**(对齐 verl 的可用资源口径):
- 不满足:更新 `state='PENDING_RESOURCES'`,并写入 `next_run_at=now+60s`
3) **提交 Ray Job 并追踪**
- 提交成功:写入 `attempts` 并更新 `tasks.latest_attempt_no``state='SUBMITTED'`
- 周期性同步 Ray 状态:`SUBMITTED/RUNNING -> SUCCEEDED/FAILED`
- 若失败命中资源不足模式:`FAILED -> PENDING_RESOURCES` + 计划下次重试
---
## 8. 接口与验收DoD
### 8.1 API 能力(最小集合)
详见 `specs/mvp/v2.0/v2_api.md`
### 8.2 验收口径DoD
1) API 提交 PPO/GRPO/SFT返回 `task_id`,并在 NFS 上创建任务目录。
2) 当集群忙GPU 不足)时:
- task 状态为 `PENDING_RESOURCES`(不是 FAILED
- 一旦资源释放,任务自动变为 `SUBMITTED/RUNNING`
3) 当 race 导致触发 verl fail-fast
- attempt 标记为 `INSUFFICIENT_RESOURCES`
- task 回到 `PENDING_RESOURCES`,并在 60s 后自动重试
4) 通过 API 查询 task 能看到:
- 当前 state
- 最新 attempt 的 `ray_submission_id`
- attempt 历史(至少包含开始/结束/失败原因)
5) Cancel 能停止正在运行的 Ray Job调用 Ray Jobs SDK stop
---
## 9. v2.0 交付物建议(目录)
`specs/mvp/v2.0/`(本目录):
- `v2_plan.md`:总体设计与开发计划(本文件)
- `v2_api.md`API 详细定义(请求/响应/字段/错误码)
代码建议位置(后续实现时):
- `src/mvp/v2.0/`
- `py/`API server + scheduler
- `scripts/`:启动/停止/查看状态(仍沿用 v1.1 的 compose/cluster 逻辑)

65
src/mvp/v1.1/README.md Normal file
View File

@ -0,0 +1,65 @@
# MVP v1.1GRPO + SFT on Ray运行说明
本目录是一套**独立可运行**的 v1.1 交付:使用 1 个 Ray head不跑训练+ 2 个 Ray worker各 4 GPU在同一宿主机通过 `docker exec` 协调容器,并通过 **head 上的 `ray job submit`** 提交作业,同时强制 driver 落到 worker。
> 远程 dev 环境推荐目录布局:
>
> - `/home2/argus/infra/mvp/`
> - `shared/`持久化datasets/hf/jobs/...
> - `verl/`(代码仓库,用于 prepare / snapshot
> - `v1.1/`本目录内容compose + scripts
---
## 快速开始(远程机 argus@h1
`/home2/argus/infra/mvp/v1.1/` 下执行:
```bash
./scripts/run_all.sh
```
说明:
- `run_all.sh` 会按顺序提交 `ppo -> grpo -> sft`,并等待每个 job 结束后再提交下一个(避免 8 卡集群并发提交导致 “available GPUs 0” 直接失败)。
等价的“分步执行”:
```bash
./scripts/00_prereq_check.sh
./scripts/03_cleanup_v1_legacy.sh
./scripts/05_ensure_verl_repo.sh
./scripts/01_up.sh
./scripts/20_start_head.sh
./scripts/21_start_workers.sh
./scripts/30_prepare_data_and_model.sh
./scripts/12_install_py_deps.sh
./scripts/44_submit_sdk.sh /workspace/mvp/v1.1/py/configs/dev.yaml /workspace/mvp/v1.1/py/jobspecs/ppo.yaml # no-wait
./scripts/44_submit_sdk.sh /workspace/mvp/v1.1/py/configs/dev.yaml /workspace/mvp/v1.1/py/jobspecs/grpo.yaml # no-wait
./scripts/44_submit_sdk.sh /workspace/mvp/v1.1/py/configs/dev.yaml /workspace/mvp/v1.1/py/jobspecs/sft.yaml # no-wait
./scripts/50_status.sh
```
停止并清理:
```bash
./scripts/02_down.sh
```
---
## 关键约束(必须满足)
- **必须通过 head 执行 `ray job submit`** 提交任务(满足“从 head 提交”要求)。
- **head 不跑训练**head 以 `--num-cpus=0 --num-gpus=0` 启动worker 具备自定义资源 `worker_node`;提交时 `--entrypoint-resources='{"worker_node": 1}'` 强制 driver 落 worker。
- **共享路径统一为 `/private`(容器内)**compose 将宿主机 `../shared` 挂载到容器内 `/private`,对齐生产环境。
- **job 级别 code_path**:训练 JobSpec 中的 `code_path` 指向 `/private/common/code/verl/verl_repo`(由 `scripts/30_prepare_data_and_model.sh` 准备)。
---
## 共享目录(容器内 /private
- `/private/datasets/`数据PPO 的 gsm8k RL parquet、SFT parquet
- `/private/hf/`HF 缓存(模型持久化,避免重复下载)
- `/private/jobs/<submission_id>/`:每个 Ray Job 的输出目录logs/config/debug/checkpoints
- `/private/common/`:共享区(模型/数据/代码快照)
- `/private/user/`:用户自定义代码(例如 reward_fn

View File

@ -0,0 +1,91 @@
version: "3.8"
services:
ray_head:
image: verlai/verl:sgl055.latest
container_name: mvp11-ray-head
command: sleep infinity
ports:
- "8265:8265"
- "8080:8080"
volumes:
- ../verl:/workspace/verl
- ../shared:/private
- .:/workspace/mvp/v1.1
- ../v2:/workspace/mvp/v2
shm_size: "10g"
ulimits:
nofile:
soft: 65536
hard: 65536
cap_add:
- SYS_ADMIN
- SYS_PTRACE
networks:
- mvp11-ray-net
environment:
HF_HOME: "/private/hf"
HUGGINGFACE_HUB_CACHE: "/private/hf/hub"
TRANSFORMERS_CACHE: "/private/hf/transformers"
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
ray_worker_0:
image: verlai/verl:sgl055.latest
container_name: mvp11-ray-worker-0
command: sleep infinity
volumes:
- ../verl:/workspace/verl
- ../shared:/private
- .:/workspace/mvp/v1.1
shm_size: "10g"
ulimits:
nofile:
soft: 65536
hard: 65536
cap_add:
- SYS_ADMIN
- SYS_PTRACE
networks:
- mvp11-ray-net
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: "0,1,2,3"
NVIDIA_DRIVER_CAPABILITIES: "all"
HF_HOME: "/private/hf"
HUGGINGFACE_HUB_CACHE: "/private/hf/hub"
TRANSFORMERS_CACHE: "/private/hf/transformers"
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
ray_worker_1:
image: verlai/verl:sgl055.latest
container_name: mvp11-ray-worker-1
command: sleep infinity
volumes:
- ../verl:/workspace/verl
- ../shared:/private
- .:/workspace/mvp/v1.1
shm_size: "10g"
ulimits:
nofile:
soft: 65536
hard: 65536
cap_add:
- SYS_ADMIN
- SYS_PTRACE
networks:
- mvp11-ray-net
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: "4,5,6,7"
NVIDIA_DRIVER_CAPABILITIES: "all"
HF_HOME: "/private/hf"
HUGGINGFACE_HUB_CACHE: "/private/hf/hub"
TRANSFORMERS_CACHE: "/private/hf/transformers"
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
networks:
mvp11-ray-net:
driver: bridge

View File

@ -0,0 +1,38 @@
# Ray 基础配置dev 环境 / head 容器内视角)
#
# 说明:
# - v1.1 的 SDK submitter 会读取本文件作为 RayConfig。
# - v2.0 的 API 服务/调度器也复用本文件作为“基础 RayConfig”并在其上扩展 v2 专属配置项(见 v2:)。
address: "http://127.0.0.1:8265"
# 容器内共享根路径(对齐生产 /private
shared_root: "/private"
# 强制 driver 落 workerhead 不跑训练)
entrypoint_num_cpus: 1
entrypoint_resources:
worker_node: 1
# 运行时环境变量(所有 job 通用)
runtime_env:
env_vars:
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
# 用户自定义代码目录(可被 PYTHONPATH 注入)
user_code_path: "/private/user/code"
# v2.0 服务层配置v1.1 submitter 会忽略这些字段v2.0 服务会读取)
v2:
api:
host: "0.0.0.0"
port: 8080
auth:
# 内部 token 建议通过环境变量提供,避免写入配置文件
token_env: "MVP_INTERNAL_TOKEN"
sqlite:
db_path: "/private/common/db/mvp_v2.sqlite3"
scheduler:
tick_s: 5
retry_interval_s: 60
max_running_tasks: 1

View File

@ -0,0 +1,20 @@
workload: "grpo"
submission_id: ""
code_path: "/private/common/code/verl/verl_repo"
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
train_file: "/private/datasets/gsm8k/train.parquet"
val_file: "/private/datasets/gsm8k/test.parquet"
nnodes: 2
n_gpus_per_node: 4
total_epochs: 1
total_training_steps: 10
save_freq: 10
test_freq: -1

View File

@ -0,0 +1,22 @@
workload: "ppo"
# 可选:不填则 submitter 自动生成
submission_id: ""
# 多版本:指向 code snapshot由 scripts/31_snapshot_verl_code.sh 生成)
code_path: "/private/common/code/verl/verl_repo"
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
train_file: "/private/datasets/gsm8k/train.parquet"
val_file: "/private/datasets/gsm8k/test.parquet"
nnodes: 2
n_gpus_per_node: 4
total_epochs: 1
total_training_steps: 10
save_freq: 10
test_freq: -1

View File

@ -0,0 +1,22 @@
workload: "sft"
submission_id: ""
code_path: "/private/common/code/verl/verl_repo"
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
train_file: "/private/datasets/gsm8k_sft/train.parquet"
val_file: null
nnodes: 2
n_gpus_per_node: 4
total_epochs: 1
total_training_steps: 10
save_freq: 10
# SFT driver 默认不分配 GPUray job entrypoint 不指定 entrypoint_num_gpus因此 driver 侧不要依赖 CUDA
trainer_device: "cpu"

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,96 @@
from __future__ import annotations
from dataclasses import dataclass
from .models import JobSpec
@dataclass(frozen=True)
class BuiltCommand:
argv: list[str]
def build_training_argv(spec: JobSpec, submission_id: str, job_dir: str) -> BuiltCommand:
"""
Returns argv for the actual training process (Hydra overrides preserved).
This argv is executed by a lightweight Python driver entrypoint.
"""
if spec.workload in ("ppo", "grpo"):
algo_overrides: list[str] = []
if spec.workload == "grpo":
algo_overrides.append("algorithm.adv_estimator=grpo")
test_freq = spec.test_freq if spec.test_freq is not None else -1
val_file = spec.val_file if spec.val_file is not None else "null"
argv = [
"python3",
"-m",
"verl.trainer.main_ppo",
f"data.train_files={spec.train_file}",
f"data.val_files={val_file}",
"data.train_batch_size=256",
"data.max_prompt_length=512",
"data.max_response_length=512",
f"actor_rollout_ref.model.path={spec.model_id}",
"actor_rollout_ref.actor.optim.lr=1e-6",
"actor_rollout_ref.actor.ppo_mini_batch_size=64",
"actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4",
"actor_rollout_ref.rollout.name=sglang",
"actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8",
"actor_rollout_ref.rollout.tensor_model_parallel_size=1",
"actor_rollout_ref.rollout.gpu_memory_utilization=0.4",
"actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4",
"critic.optim.lr=1e-5",
f"critic.model.path={spec.model_id}",
"critic.ppo_micro_batch_size_per_gpu=4",
"algorithm.kl_ctrl.kl_coef=0.001",
*algo_overrides,
"trainer.logger=console",
"trainer.val_before_train=False",
f"trainer.n_gpus_per_node={spec.n_gpus_per_node}",
f"trainer.nnodes={spec.nnodes}",
f"trainer.save_freq={spec.save_freq}",
f"trainer.test_freq={test_freq}",
f"trainer.total_epochs={spec.total_epochs}",
f"trainer.total_training_steps={spec.total_training_steps}",
"trainer.resume_mode=disable",
f"trainer.default_local_dir={job_dir}/checkpoints",
"+ray_kwargs.ray_init.address=auto",
f"hydra.run.dir={job_dir}/logs/hydra",
]
return BuiltCommand(argv=argv)
if spec.workload == "sft":
val_override = "null" if spec.val_file is None else spec.val_file
trainer_device = spec.trainer_device or "cpu"
argv = [
"python3",
"-m",
"verl.trainer.sft_trainer_ray",
f"model.path={spec.model_id}",
f"data.train_files={spec.train_file}",
f"data.val_files={val_override}",
"data.train_batch_size=64",
"data.micro_batch_size_per_gpu=1",
"data.max_token_len_per_gpu=2048",
"data.max_length=1024",
"trainer.logger=console",
"trainer.project_name=mvp11-sft",
f"trainer.experiment_name={submission_id}",
f"trainer.total_epochs={spec.total_epochs}",
f"trainer.total_training_steps={spec.total_training_steps}",
f"trainer.save_freq={spec.save_freq}",
"trainer.test_freq=-1",
"trainer.resume_mode=disable",
f"trainer.device={trainer_device}",
f"trainer.default_local_dir={job_dir}/checkpoints",
f"trainer.nnodes={spec.nnodes}",
f"trainer.n_gpus_per_node={spec.n_gpus_per_node}",
f"hydra.run.dir={job_dir}/logs/hydra",
]
return BuiltCommand(argv=argv)
raise ValueError(f"unsupported workload: {spec.workload}")

View File

@ -0,0 +1,63 @@
from __future__ import annotations
import argparse
import os
import shlex
import subprocess
import sys
from pathlib import Path
def _preflight() -> None:
print("MVP_PRECHECK_PYTHON:", sys.executable, flush=True)
print("MVP_PRECHECK_PYTHONPATH:", os.environ.get("PYTHONPATH"), flush=True)
print("MVP_PRECHECK_MVP_CODE_PATH:", os.environ.get("MVP_CODE_PATH"), flush=True)
try:
import verl # type: ignore
print("MVP_PRECHECK_VERL_FILE:", getattr(verl, "__file__", None), flush=True)
except Exception as e:
print("MVP_PRECHECK_VERL_IMPORT_ERROR:", repr(e), flush=True)
try:
import mvp_marker # type: ignore
print("MVP_PRECHECK_MARKER:", getattr(mvp_marker, "MARKER", None), flush=True)
except Exception as e:
print("MVP_PRECHECK_MARKER_MISSING:", repr(e), flush=True)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--job-dir", required=True)
parser.add_argument("cmd", nargs=argparse.REMAINDER)
args = parser.parse_args()
job_dir = Path(args.job_dir)
job_dir.mkdir(parents=True, exist_ok=True)
_preflight()
if not args.cmd:
print("no command provided", file=sys.stderr)
return 2
# argparse includes the leading "--" if the caller uses it; strip it.
cmd = list(args.cmd)
if cmd and cmd[0] == "--":
cmd = cmd[1:]
if not cmd:
print("no command provided", file=sys.stderr)
return 2
# Execute training command as a subprocess so that logs are captured by Ray job logs.
cmd_str = " ".join(shlex.quote(x) for x in cmd)
print("MVP_DRIVER_EXEC:", cmd_str, flush=True)
proc = subprocess.run(cmd, check=False)
print("MVP_DRIVER_EXIT_CODE:", proc.returncode, flush=True)
return proc.returncode
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -0,0 +1,121 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Any
def _require(d: dict[str, Any], key: str) -> Any:
if key not in d or d[key] in (None, ""):
raise ValueError(f"missing required field: {key}")
return d[key]
@dataclass(frozen=True)
class RayConfig:
address: str
shared_root: str
entrypoint_num_cpus: float
entrypoint_resources: dict[str, float]
runtime_env_env_vars: dict[str, str]
user_code_path: str
@staticmethod
def from_dict(d: dict[str, Any]) -> "RayConfig":
runtime_env = d.get("runtime_env") or {}
env_vars = (runtime_env.get("env_vars") or {}) if isinstance(runtime_env, dict) else {}
if not isinstance(env_vars, dict):
raise ValueError("runtime_env.env_vars must be a mapping")
entrypoint_resources = d.get("entrypoint_resources") or {}
if not isinstance(entrypoint_resources, dict):
raise ValueError("entrypoint_resources must be a mapping")
return RayConfig(
address=str(_require(d, "address")),
shared_root=str(_require(d, "shared_root")),
entrypoint_num_cpus=float(d.get("entrypoint_num_cpus", 1)),
entrypoint_resources={str(k): float(v) for k, v in entrypoint_resources.items()},
runtime_env_env_vars={str(k): str(v) for k, v in env_vars.items()},
user_code_path=str(d.get("user_code_path", f"{_require(d, 'shared_root')}/user/code")),
)
def to_public_dict(self) -> dict[str, Any]:
return {
"address": self.address,
"shared_root": self.shared_root,
"entrypoint_num_cpus": self.entrypoint_num_cpus,
"entrypoint_resources": self.entrypoint_resources,
"runtime_env": {"env_vars": self.runtime_env_env_vars},
"user_code_path": self.user_code_path,
}
@dataclass(frozen=True)
class JobSpec:
workload: str # ppo|grpo|sft
submission_id: str | None
code_path: str
model_id: str
train_file: str
val_file: str | None
nnodes: int
n_gpus_per_node: int
total_epochs: int
total_training_steps: int
save_freq: int
test_freq: int | None
trainer_device: str | None # only for sft (driver-side device)
@staticmethod
def from_dict(d: dict[str, Any]) -> "JobSpec":
workload = str(_require(d, "workload"))
if workload not in ("ppo", "grpo", "sft"):
raise ValueError(f"unsupported workload: {workload}")
val_file = d.get("val_file", None)
if val_file in ("", "null"):
val_file = None
test_freq = d.get("test_freq", None)
if test_freq in ("", "null"):
test_freq = None
return JobSpec(
workload=workload,
submission_id=(str(d["submission_id"]) if d.get("submission_id") else None),
code_path=str(_require(d, "code_path")),
model_id=str(_require(d, "model_id")),
train_file=str(_require(d, "train_file")),
val_file=(str(val_file) if val_file is not None else None),
nnodes=int(d.get("nnodes", 2)),
n_gpus_per_node=int(d.get("n_gpus_per_node", 4)),
total_epochs=int(d.get("total_epochs", 1)),
total_training_steps=int(d.get("total_training_steps", 10)),
save_freq=int(d.get("save_freq", 10)),
test_freq=(int(test_freq) if test_freq is not None else None),
trainer_device=(str(d.get("trainer_device")) if d.get("trainer_device") else None),
)
def to_public_dict(self) -> dict[str, Any]:
out: dict[str, Any] = {
"workload": self.workload,
"submission_id": self.submission_id or "",
"code_path": self.code_path,
"model_id": self.model_id,
"train_file": self.train_file,
"val_file": self.val_file,
"nnodes": self.nnodes,
"n_gpus_per_node": self.n_gpus_per_node,
"total_epochs": self.total_epochs,
"total_training_steps": self.total_training_steps,
"save_freq": self.save_freq,
"test_freq": self.test_freq,
}
if self.workload == "sft":
out["trainer_device"] = self.trainer_device or "cpu"
return out

View File

@ -0,0 +1,171 @@
from __future__ import annotations
import json
import os
import shlex
from datetime import datetime
from pathlib import Path
from typing import Any
import ray
from ray.job_submission import JobSubmissionClient
from .builders import build_training_argv
from .models import JobSpec, RayConfig
from .yaml_io import dump_yaml
def _ts() -> str:
return datetime.now().strftime("%Y%m%d_%H%M%S")
def _mkdir(p: Path) -> None:
p.mkdir(parents=True, exist_ok=True)
def _write_text(p: Path, content: str) -> None:
_mkdir(p.parent)
p.write_text(content, encoding="utf-8")
def _write_json(p: Path, obj: Any) -> None:
_write_text(p, json.dumps(obj, indent=2, ensure_ascii=False) + "\n")
def _safe_basename(path: str) -> str:
return path.rstrip("/").split("/")[-1]
class RayJobTool:
def __init__(self, cfg: RayConfig):
self.cfg = cfg
self.client = JobSubmissionClient(cfg.address)
def _job_dir(self, submission_id: str) -> str:
return f"{self.cfg.shared_root}/jobs/{submission_id}"
def _runtime_env(self, spec: JobSpec) -> dict[str, Any]:
env_vars = dict(self.cfg.runtime_env_env_vars)
# Default HF cache
env_vars.setdefault("HF_HOME", f"{self.cfg.shared_root}/hf")
env_vars.setdefault("HUGGINGFACE_HUB_CACHE", f"{self.cfg.shared_root}/hf/hub")
env_vars.setdefault("TRANSFORMERS_CACHE", f"{self.cfg.shared_root}/hf/transformers")
env_vars.setdefault("PYTHONUNBUFFERED", "1")
# Tool code path must be importable on workers (compose mounts v1.1 into all containers).
# Place it before verl code to avoid interfering with verl import priority.
tool_code_path = os.environ.get("MVP_TOOL_CODE_PATH", "/workspace/mvp/v1.1/py")
user_code_path = self.cfg.user_code_path
code_path = spec.code_path
existing = env_vars.get("PYTHONPATH", "")
prefix = f"{tool_code_path}:{code_path}:{user_code_path}"
env_vars["PYTHONPATH"] = f"{prefix}:{existing}" if existing else prefix
# For debugging / log visibility
env_vars["MVP_CODE_PATH"] = code_path
# SFT: ensure ray.init() connects to the cluster
if spec.workload == "sft":
env_vars.setdefault("RAY_ADDRESS", "auto")
return {"env_vars": env_vars}
def submit(self, spec: JobSpec, no_wait: bool) -> str:
submission_id = spec.submission_id or f"mvp11_{spec.workload}_{_ts()}_{os.getpid()}"
job_dir = self._job_dir(submission_id)
built = build_training_argv(spec, submission_id=submission_id, job_dir=job_dir)
entrypoint_argv = [
"python3",
"-m",
"mvp_v11.driver_entrypoint",
"--job-dir",
job_dir,
*built.argv,
]
entrypoint = " ".join(shlex.quote(x) for x in entrypoint_argv)
runtime_env = self._runtime_env(spec)
# Prepare job artifacts directory
job_root = Path(job_dir)
_mkdir(job_root / "config")
_mkdir(job_root / "logs")
_mkdir(job_root / "debug")
_mkdir(job_root / "checkpoints")
_write_text(job_root / "config" / "ray_config.yaml", dump_yaml(self.cfg.to_public_dict()))
_write_text(job_root / "config" / "jobspec.yaml", dump_yaml(spec.to_public_dict()))
_write_json(job_root / "config" / "submit_payload.json", {
"submission_id": submission_id,
"address": self.cfg.address,
"entrypoint": entrypoint,
"entrypoint_num_cpus": self.cfg.entrypoint_num_cpus,
"entrypoint_resources": self.cfg.entrypoint_resources,
"runtime_env": runtime_env,
})
# Pre-submit debug snapshot (ray cluster resources via ray.init)
try:
ray.init(address="auto", ignore_reinit_error=True, log_to_driver=False)
_write_json(job_root / "debug" / "ray_cluster_resources_pre.json", ray.cluster_resources())
_write_json(job_root / "debug" / "ray_available_resources_pre.json", ray.available_resources())
except Exception as e:
_write_text(job_root / "debug" / "ray_resources_pre.error.txt", repr(e) + "\n")
try:
submitted = self.client.submit_job(
entrypoint=entrypoint,
submission_id=submission_id,
runtime_env=runtime_env,
entrypoint_num_cpus=self.cfg.entrypoint_num_cpus,
entrypoint_resources=self.cfg.entrypoint_resources,
)
except Exception as e:
_write_text(job_root / "logs" / "submit.error.txt", repr(e) + "\n")
raise
_write_text(job_root / "config" / "ray_submission_id.txt", submitted + "\n")
# Post-submit debug snapshot via SDK
try:
jobs = self.client.list_jobs()
_write_text(
job_root / "debug" / "ray_job_list_post.json",
json.dumps([_job_details_to_dict(j) for j in jobs], indent=2) + "\n",
)
except Exception as e:
_write_text(job_root / "debug" / "ray_job_list_post.error.txt", repr(e) + "\n")
if not no_wait:
# caller can separately wait; keep submit non-blocking by default in scripts
pass
return submitted
def status(self, submission_id: str) -> str:
return str(self.client.get_job_status(submission_id))
def stop(self, submission_id: str) -> bool:
return bool(self.client.stop_job(submission_id))
def logs(self, submission_id: str) -> str:
return self.client.get_job_logs(submission_id)
def list(self) -> list[dict[str, Any]]:
return [_job_details_to_dict(j) for j in self.client.list_jobs()]
def _job_details_to_dict(obj: Any) -> dict[str, Any]:
# Ray uses pydantic models internally, but depending on bundled pydantic version
# we might get `.model_dump()` (v2) or `.dict()` (v1).
if hasattr(obj, "model_dump"):
return obj.model_dump() # type: ignore[no-any-return]
if hasattr(obj, "dict"):
return obj.dict() # type: ignore[no-any-return]
if hasattr(obj, "__dict__"):
return dict(obj.__dict__)
return {"repr": repr(obj)}

View File

@ -0,0 +1,21 @@
from __future__ import annotations
from pathlib import Path
from typing import Any
import yaml
def load_yaml(path: str) -> dict[str, Any]:
p = Path(path)
data = yaml.safe_load(p.read_text(encoding="utf-8"))
if data is None:
return {}
if not isinstance(data, dict):
raise ValueError(f"yaml root must be a mapping: {path}")
return data
def dump_yaml(data: dict[str, Any]) -> str:
return yaml.safe_dump(data, sort_keys=False, allow_unicode=True)

View File

@ -0,0 +1,2 @@
PyYAML>=6.0.1

69
src/mvp/v1.1/py/run.py Normal file
View File

@ -0,0 +1,69 @@
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
import os
import sys
def _ensure_import_path() -> None:
# Allow `python3 /workspace/.../py/run.py` to import `mvp_v11.*`
here = os.path.dirname(os.path.abspath(__file__))
if here not in sys.path:
sys.path.insert(0, here)
def main() -> int:
_ensure_import_path()
from mvp_v11.models import JobSpec, RayConfig
from mvp_v11.ray_job_tool import RayJobTool
from mvp_v11.yaml_io import load_yaml
parser = argparse.ArgumentParser()
parser.add_argument("--config", required=True, help="Ray base config yaml")
parser.add_argument("--jobspec", help="Training jobspec yaml (required for submit)")
parser.add_argument("--action", required=True, choices=["submit", "status", "stop", "logs", "list"])
parser.add_argument("--submission-id", help="For status/stop/logs")
parser.add_argument("--no-wait", action="store_true", help="Submit and return immediately")
parser.add_argument("--tail", type=int, default=0, help="Tail N lines for logs")
args = parser.parse_args()
cfg = RayConfig.from_dict(load_yaml(args.config))
tool = RayJobTool(cfg)
if args.action == "submit":
if not args.jobspec:
raise SystemExit("--jobspec is required for submit")
spec = JobSpec.from_dict(load_yaml(args.jobspec))
submitted = tool.submit(spec, no_wait=args.no_wait)
print(submitted)
return 0
if args.action in ("status", "stop", "logs"):
sid = args.submission_id or ""
if not sid:
raise SystemExit("--submission-id is required for status/stop/logs")
if args.action == "status":
print(tool.status(sid))
return 0
if args.action == "stop":
print(tool.stop(sid))
return 0
logs = tool.logs(sid)
if args.tail and args.tail > 0:
lines = logs.splitlines()
logs = "\n".join(lines[-args.tail :]) + ("\n" if lines else "")
print(logs, end="")
return 0
if args.action == "list":
print(json.dumps(tool.list(), indent=2))
return 0
raise SystemExit(f"unknown action: {args.action}")
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
echo "[host] prereq check"
require_cmd docker
require_cmd bash
if ! docker info >/dev/null 2>&1; then
echo "docker is not available (docker info failed)" >&2
exit 1
fi
if ! docker compose version >/dev/null 2>&1; then
echo "docker compose is not available" >&2
exit 1
fi
if ! command -v nvidia-smi >/dev/null 2>&1; then
echo "WARN: nvidia-smi not found on host; GPU validation skipped"
else
echo "[host] GPU summary"
nvidia-smi -L || true
fi
echo "[host] ensure shared dirs exist under ../shared"
mkdir -p "${ROOT_DIR}/../shared"/{datasets,hf,jobs,outputs,ray,common,user}
mkdir -p "${ROOT_DIR}/../shared/common"/{code,datasets,models}
mkdir -p "${ROOT_DIR}/../shared/user"/{code}
echo "[host] ensure verl repo exists under ../verl (required by prepare scripts)"
if [[ ! -d "${ROOT_DIR}/../verl" ]]; then
echo "missing ../verl. On remote, ensure /home2/argus/infra/mvp/verl exists (git clone volcengine/verl)." >&2
exit 1
fi
echo "ok"

View File

@ -0,0 +1,16 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
if [[ "${SKIP_CLEANUP_V1:-0}" != "1" ]]; then
"${SCRIPT_DIR}/03_cleanup_v1_legacy.sh" || true
fi
echo "[host] docker compose up -d (v1.1)"
dc up -d
echo "[host] containers:"
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' | (head -n 1 && grep -E "mvp11-ray-") || true

View File

@ -0,0 +1,12 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
echo "[host] docker compose down (v1.1)"
dc down -v || true
echo "[host] done"

View File

@ -0,0 +1,16 @@
#!/usr/bin/env bash
set -euo pipefail
echo "[host] cleanup v1 legacy containers (best-effort)"
LEGACY=(mvp-ray-head mvp-ray-worker-0 mvp-ray-worker-1)
for c in "${LEGACY[@]}"; do
if docker ps -a --format '{{.Names}}' | grep -qx "${c}"; then
echo "[host] removing legacy container: ${c}"
docker rm -f "${c}" >/dev/null 2>&1 || true
fi
done
echo "[host] legacy cleanup done"

View File

@ -0,0 +1,23 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
VERL_DIR="${ROOT_DIR}/../verl"
echo "[host] ensure verl repo exists at: ${VERL_DIR}"
if [[ -d "${VERL_DIR}/.git" ]]; then
echo "verl_repo_exists: skip"
exit 0
fi
if [[ -d "${VERL_DIR}" && ! -d "${VERL_DIR}/.git" ]]; then
echo "ERROR: ${VERL_DIR} exists but is not a git repo; please fix manually." >&2
exit 1
fi
echo "cloning volcengine/verl -> ${VERL_DIR}"
git clone https://github.com/volcengine/verl.git "${VERL_DIR}"

View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
echo "[head] install python deps for v1.1 SDK submitter (PyYAML)"
dexec "${HEAD_CONTAINER}" bash -lc "pip install -r /workspace/mvp/v1.1/py/requirements.txt"

View File

@ -0,0 +1,18 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
HEAD_IP="$(container_ip "${HEAD_CONTAINER}")"
echo "[head] ray stop (best-effort)"
dexec "${HEAD_CONTAINER}" bash -lc "ray stop --force || true"
echo "[head] start ray head (CPU=0 GPU=0): ${HEAD_IP}"
dexec "${HEAD_CONTAINER}" bash -lc "ray start --head --node-ip-address='${HEAD_IP}' --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=8265 --num-cpus=0 --num-gpus=0 --disable-usage-stats"
echo "[head] ray status"
dexec "${HEAD_CONTAINER}" bash -lc "ray status || true"

View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
HEAD_IP="$(container_ip "${HEAD_CONTAINER}")"
HEAD_ADDR="${HEAD_IP}:6379"
start_one() {
local worker="$1"
local ip
ip="$(container_ip "${worker}")"
echo "[${worker}] ray stop (best-effort)"
dexec "${worker}" bash -lc "ray stop --force || true"
echo "[${worker}] start ray worker -> head ${HEAD_ADDR}"
dexec "${worker}" bash -lc "ray start --address='${HEAD_ADDR}' --node-ip-address='${ip}' --resources='{\"worker_node\": 100}' --disable-usage-stats"
}
start_one "${WORKER0_CONTAINER}"
start_one "${WORKER1_CONTAINER}"
echo "[head] ray status"
dexec "${HEAD_CONTAINER}" bash -lc "ray status || true"

View File

@ -0,0 +1,86 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
MODEL_ID="${MODEL_ID:-Qwen/Qwen2.5-0.5B-Instruct}"
PPO_DATA_DIR="${SHARED_ROOT}/datasets/gsm8k"
SFT_DATA_DIR="${SHARED_ROOT}/datasets/gsm8k_sft"
CODE_SNAPSHOT_DIR="${SHARED_ROOT}/common/code/verl/verl_repo"
echo "[head] ensure dataset dirs exist"
dexec "${HEAD_CONTAINER}" bash -lc "mkdir -p '${PPO_DATA_DIR}' '${SFT_DATA_DIR}'"
echo "[head] prepare PPO dataset (gsm8k RL parquet) -> ${PPO_DATA_DIR}"
dexec "${HEAD_CONTAINER}" bash -lc "if [[ -f '${PPO_DATA_DIR}/train.parquet' && -f '${PPO_DATA_DIR}/test.parquet' ]]; then echo 'ppo_dataset_exists: skip'; else python3 /workspace/verl/examples/data_preprocess/gsm8k.py --local_save_dir '${PPO_DATA_DIR}'; fi"
echo "[head] prepare SFT dataset (gsm8k messages parquet) -> ${SFT_DATA_DIR}"
if dexec "${HEAD_CONTAINER}" bash -lc "test -f '${SFT_DATA_DIR}/train.parquet'"; then
echo "[head] sft_dataset_exists: skip"
else
SFT_PY_CODE="$(cat <<'PY'
import os
import pandas as pd
from datasets import load_dataset
out_dir = os.environ["SFT_DATA_DIR"]
os.makedirs(out_dir, exist_ok=True)
ds = load_dataset("openai/gsm8k", "main")
instruction = "Let's think step by step and output the final answer after \"####\"."
def to_messages(example):
q = example["question"].strip() + " " + instruction
a = example["answer"]
return {
"messages": [
{"role": "user", "content": q},
{"role": "assistant", "content": a},
]
}
train = ds["train"].map(to_messages, remove_columns=ds["train"].column_names)
test = ds["test"].map(to_messages, remove_columns=ds["test"].column_names)
pd.DataFrame(train).to_parquet(os.path.join(out_dir, "train.parquet"), index=False)
pd.DataFrame(test).to_parquet(os.path.join(out_dir, "test.parquet"), index=False)
print("sft_dataset_written_ok:", out_dir)
PY
)"
printf "%s\n" "${SFT_PY_CODE}" | dexec "${HEAD_CONTAINER}" bash -lc "SFT_DATA_DIR='${SFT_DATA_DIR}' python3 -"
fi
echo "[head] ensure model cached to persistent HF_HOME (idempotent) -> ${MODEL_ID}"
PY_CODE="$(cat <<'PY'
import os
model_id = os.environ["MODEL_ID"]
hf_home = os.environ.get("HF_HOME", "/private/hf")
os.environ.setdefault("HF_HOME", hf_home)
os.environ.setdefault("HUGGINGFACE_HUB_CACHE", os.path.join(hf_home, "hub"))
os.environ.setdefault("TRANSFORMERS_CACHE", os.path.join(hf_home, "transformers"))
from huggingface_hub import snapshot_download
try:
snapshot_download(repo_id=model_id, local_files_only=True)
print("model_cache_exists: skip", model_id)
except Exception:
print("model_cache_missing: downloading", model_id)
snapshot_download(repo_id=model_id)
print("model_cached_ok:", model_id)
PY
)"
printf "%s\n" "${PY_CODE}" | dexec "${HEAD_CONTAINER}" bash -lc "MODEL_ID='${MODEL_ID}' python3 -"
echo "[head] snapshot verl repo into shared common code path (idempotent best-effort) -> ${CODE_SNAPSHOT_DIR}"
dexec "${HEAD_CONTAINER}" bash -lc "mkdir -p '${CODE_SNAPSHOT_DIR}' && if command -v rsync >/dev/null 2>&1; then rsync -a --delete /workspace/verl/ '${CODE_SNAPSHOT_DIR}/'; else rm -rf '${CODE_SNAPSHOT_DIR:?}/'* && cp -a /workspace/verl/. '${CODE_SNAPSHOT_DIR}/'; fi && echo 'code_snapshot_ok'"

View File

@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
# Create an immutable-ish code snapshot under:
# ${SHARED_ROOT}/common/code/verl/<code_id>
#
# By default, code_id is the git commit hash of /workspace/verl (mounted from ../verl).
#
# This enables job-level multi-version coexistence via runtime_env PYTHONPATH injection.
CODE_ID="${CODE_ID:-}"
if [[ -z "${CODE_ID}" ]]; then
CODE_ID="$(dexec "${HEAD_CONTAINER}" bash -lc "git config --global --add safe.directory /workspace/verl >/dev/null 2>&1 || true; git -C /workspace/verl rev-parse HEAD")"
fi
DEST_DIR="${SHARED_ROOT}/common/code/verl/${CODE_ID}"
echo "[head] snapshot verl repo -> ${DEST_DIR}"
if dexec "${HEAD_CONTAINER}" bash -lc "test -d '${DEST_DIR}'"; then
echo "[head] code_snapshot_exists: skip"
exit 0
fi
dexec "${HEAD_CONTAINER}" bash -lc "mkdir -p '${DEST_DIR}'"
# Copy code (no .git needed for runtime)
if dexec "${HEAD_CONTAINER}" bash -lc "command -v rsync >/dev/null 2>&1"; then
dexec "${HEAD_CONTAINER}" bash -lc "rsync -a --delete --exclude='.git' /workspace/verl/ '${DEST_DIR}/'"
else
dexec "${HEAD_CONTAINER}" bash -lc "tar -C /workspace/verl -cf - --exclude='.git' . | tar -C '${DEST_DIR}' -xf -"
fi
# Add a tiny marker module for multi-version validation in Ray job logs.
dexec "${HEAD_CONTAINER}" bash -lc "printf \"%s\\n\" \"MARKER = '${CODE_ID}'\" > '${DEST_DIR}/mvp_marker.py'"
echo "[head] code_snapshot_ok: ${CODE_ID}"

View File

@ -0,0 +1,19 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
CONFIG_PATH="${1:-/workspace/mvp/v1.1/py/configs/dev.yaml}"
JOBSPEC_PATH="${2:-}"
if [[ -z "${JOBSPEC_PATH}" ]]; then
echo "usage: $0 <ray_config_yaml_in_container> <jobspec_yaml_in_container>" >&2
echo "example: $0 /workspace/mvp/v1.1/py/configs/dev.yaml /workspace/mvp/v1.1/py/jobspecs/ppo.yaml" >&2
exit 1
fi
echo "[head] submit via Ray Python SDK"
dexec "${HEAD_CONTAINER}" bash -lc "python3 /workspace/mvp/v1.1/py/run.py --config '${CONFIG_PATH}' --jobspec '${JOBSPEC_PATH}' --action submit --no-wait"

View File

@ -0,0 +1,13 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
echo "[head] ray status"
dexec "${HEAD_CONTAINER}" bash -lc "ray status || true"
echo "[head] ray job list"
dexec "${HEAD_CONTAINER}" bash -lc "ray job list || true"

View File

@ -0,0 +1,52 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
COMPOSE_FILE="${ROOT_DIR}/docker-compose.yaml"
HEAD_CONTAINER="mvp11-ray-head"
WORKER0_CONTAINER="mvp11-ray-worker-0"
WORKER1_CONTAINER="mvp11-ray-worker-1"
SHARED_ROOT="${SHARED_ROOT:-/private}"
RAY_DASHBOARD_ADDR="${RAY_DASHBOARD_ADDR:-http://127.0.0.1:8265}"
dc() {
docker compose --project-directory "${ROOT_DIR}" -f "${COMPOSE_FILE}" "$@"
}
require_cmd() {
local cmd="$1"
command -v "${cmd}" >/dev/null 2>&1 || {
echo "missing required command: ${cmd}" >&2
exit 1
}
}
ensure_container_running() {
local name="$1"
if ! docker ps --format '{{.Names}}' | grep -qx "${name}"; then
echo "container not running: ${name}" >&2
exit 1
fi
}
dexec() {
local name="$1"
shift
ensure_container_running "${name}"
docker exec -i "${name}" "$@"
}
container_ip() {
local name="$1"
ensure_container_running "${name}"
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' "${name}"
}
timestamp() {
date +"%Y%m%d_%H%M%S"
}

View File

@ -0,0 +1,56 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
submit_and_wait() {
local jobspec_in_container="$1"
local sid
local out
echo "[host] submit via SDK: ${jobspec_in_container}"
out="$(dexec "${HEAD_CONTAINER}" bash -lc "python3 /workspace/mvp/v1.1/py/run.py --config /workspace/mvp/v1.1/py/configs/dev.yaml --jobspec '${jobspec_in_container}' --action submit --no-wait" | tr -d '\r')"
sid="$(printf '%s\n' "${out}" | tail -n 1)"
if [[ -z "${sid}" ]]; then
echo "[host] failed to parse submission id from output:" >&2
printf '%s\n' "${out}" >&2
exit 1
fi
echo "[host] submitted: ${sid}"
while true; do
st="$(dexec "${HEAD_CONTAINER}" bash -lc "python3 /workspace/mvp/v1.1/py/run.py --config /workspace/mvp/v1.1/py/configs/dev.yaml --action status --submission-id '${sid}'" | tr -d '\r' | tail -n 1)"
echo "[host] status: ${sid} -> ${st}"
case "${st}" in
*SUCCEEDED*)
return 0
;;
*FAILED*|*STOPPED*)
echo "[host] job failed: ${sid} (${st})" >&2
echo "[host] last logs:" >&2
dexec "${HEAD_CONTAINER}" bash -lc "python3 /workspace/mvp/v1.1/py/run.py --config /workspace/mvp/v1.1/py/configs/dev.yaml --action logs --submission-id '${sid}' --tail 200" >&2 || true
return 1
;;
*)
sleep 10
;;
esac
done
}
"${SCRIPT_DIR}/00_prereq_check.sh"
"${SCRIPT_DIR}/03_cleanup_v1_legacy.sh"
"${SCRIPT_DIR}/05_ensure_verl_repo.sh"
"${SCRIPT_DIR}/01_up.sh"
"${SCRIPT_DIR}/20_start_head.sh"
"${SCRIPT_DIR}/21_start_workers.sh"
"${SCRIPT_DIR}/30_prepare_data_and_model.sh"
"${SCRIPT_DIR}/12_install_py_deps.sh"
submit_and_wait /workspace/mvp/v1.1/py/jobspecs/ppo.yaml
submit_and_wait /workspace/mvp/v1.1/py/jobspecs/grpo.yaml
submit_and_wait /workspace/mvp/v1.1/py/jobspecs/sft.yaml
"${SCRIPT_DIR}/50_status.sh"

1877
src/mvp/v1/arch.excalidraw Normal file

File diff suppressed because it is too large Load Diff

104
src/mvp/v2.0/README.md Normal file
View File

@ -0,0 +1,104 @@
# MVP v2.0(服务化入口)
v2.0 在 v1.1Ray Jobs SDK 提交链路)基础上新增一个**服务层**
- HTTP API 提交任务PPO/GRPO/SFT
- 服务侧队列 + gang 资源判断
- 识别 `verl` fail-fast 的资源不足失败并自动重试
- SQLite 持久化队列/状态/attemptNFS`/private`
设计文档见:
- `specs/mvp/v2.0/v2_plan.md`
- `specs/mvp/v2.0/v2_api.md`
## 快速开始dev 示例)
约定:
- Ray 容器仍由 v1.1 的 `docker-compose.yaml` 启动head+2 workers
- v2 代码在宿主机:`/home2/argus/infra/mvp/v2/`(容器内挂载 `/workspace/mvp/v2/`
- v2 配置复用 v1.1`/workspace/mvp/v1.1/py/configs/dev.yaml`(扩展了 `v2:` 段)
宿主机执行:
```bash
export MVP_INTERNAL_TOKEN=... # 内部 token
cd /home2/argus/infra/mvp/v2/scripts
./12_install_v2_deps.sh
./20_start_api.sh
./22_status_api.sh
```
API 测试(宿主机):
```bash
curl -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" http://127.0.0.1:8080/api/v2/queue
```
> 进程日志与 pid容器内路径默认在 `/private/common/logs/``/private/common/run/`
## 提交/查询/停止任务
约定:
- API 地址(宿主机视角):`http://127.0.0.1:8080`
- 鉴权:`Authorization: Bearer ${MVP_INTERNAL_TOKEN}`
- 请求体:**raw YAML**JobSpec格式与 v1.1 jobspec 一致)
### 1) 提交任务POST /api/v2/tasks
准备一个 jobspec示例PPO
```yaml
workload: "ppo"
submission_id: "" # v2 会忽略/覆盖,自动生成 task_id 并派生 ray_submission_id
code_path: "/private/common/code/verl/verl_repo"
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
train_file: "/private/datasets/gsm8k/train.parquet"
val_file: "/private/datasets/gsm8k/test.parquet"
nnodes: 2
n_gpus_per_node: 4
total_epochs: 1
total_training_steps: 10
save_freq: 10
test_freq: -1
trainer_device: null
```
提交:
```bash
curl -sS \
-H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \
-H "Content-Type: application/yaml" \
--data-binary @jobspec.yaml \
http://127.0.0.1:8080/api/v2/tasks
```
返回示例:
```json
{"task_id":"mvp2-ppo-20251223-082813-6426","state":"QUEUED"}
```
### 2) 查询任务GET /api/v2/tasks/{task_id}
```bash
curl -sS \
-H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \
http://127.0.0.1:8080/api/v2/tasks/<task_id> | python3 -m json.tool
```
可选:
- 查看 attempts`GET /api/v2/tasks/{task_id}/attempts`
- 拉取日志latest attempt`GET /api/v2/tasks/{task_id}/logs?tail=2000`
- 查看队列:`GET /api/v2/queue`
### 3) 停止/取消任务POST /api/v2/tasks/{task_id}:cancel
```bash
curl -sS -X POST \
-H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \
http://127.0.0.1:8080/api/v2/tasks/<task_id>:cancel
```
说明:
- 若任务已提交到 Ray`SUBMITTED/RUNNING`),服务会调用 Ray Jobs SDK `stop_job(ray_submission_id)`
- 若任务仍在队列(`QUEUED/PENDING_RESOURCES`),服务直接标记 `CANCELED`(不会产生 attempt

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,96 @@
from __future__ import annotations
from dataclasses import dataclass
from .models import JobSpec
@dataclass(frozen=True)
class BuiltCommand:
argv: list[str]
def build_training_argv(spec: JobSpec, submission_id: str, job_dir: str) -> BuiltCommand:
"""
Returns argv for the actual training process (Hydra overrides preserved).
This argv is executed by a lightweight Python driver entrypoint.
"""
if spec.workload in ("ppo", "grpo"):
algo_overrides: list[str] = []
if spec.workload == "grpo":
algo_overrides.append("algorithm.adv_estimator=grpo")
test_freq = spec.test_freq if spec.test_freq is not None else -1
val_file = spec.val_file if spec.val_file is not None else "null"
argv = [
"python3",
"-m",
"verl.trainer.main_ppo",
f"data.train_files={spec.train_file}",
f"data.val_files={val_file}",
"data.train_batch_size=256",
"data.max_prompt_length=512",
"data.max_response_length=512",
f"actor_rollout_ref.model.path={spec.model_id}",
"actor_rollout_ref.actor.optim.lr=1e-6",
"actor_rollout_ref.actor.ppo_mini_batch_size=64",
"actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4",
"actor_rollout_ref.rollout.name=sglang",
"actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8",
"actor_rollout_ref.rollout.tensor_model_parallel_size=1",
"actor_rollout_ref.rollout.gpu_memory_utilization=0.4",
"actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4",
"critic.optim.lr=1e-5",
f"critic.model.path={spec.model_id}",
"critic.ppo_micro_batch_size_per_gpu=4",
"algorithm.kl_ctrl.kl_coef=0.001",
*algo_overrides,
"trainer.logger=console",
"trainer.val_before_train=False",
f"trainer.n_gpus_per_node={spec.n_gpus_per_node}",
f"trainer.nnodes={spec.nnodes}",
f"trainer.save_freq={spec.save_freq}",
f"trainer.test_freq={test_freq}",
f"trainer.total_epochs={spec.total_epochs}",
f"trainer.total_training_steps={spec.total_training_steps}",
"trainer.resume_mode=disable",
f"trainer.default_local_dir={job_dir}/checkpoints",
"+ray_kwargs.ray_init.address=auto",
f"hydra.run.dir={job_dir}/logs/hydra",
]
return BuiltCommand(argv=argv)
if spec.workload == "sft":
val_override = "null" if spec.val_file is None else spec.val_file
trainer_device = spec.trainer_device or "cpu"
argv = [
"python3",
"-m",
"verl.trainer.sft_trainer_ray",
f"model.path={spec.model_id}",
f"data.train_files={spec.train_file}",
f"data.val_files={val_override}",
"data.train_batch_size=64",
"data.micro_batch_size_per_gpu=1",
"data.max_token_len_per_gpu=2048",
"data.max_length=1024",
"trainer.logger=console",
"trainer.project_name=mvp11-sft",
f"trainer.experiment_name={submission_id}",
f"trainer.total_epochs={spec.total_epochs}",
f"trainer.total_training_steps={spec.total_training_steps}",
f"trainer.save_freq={spec.save_freq}",
"trainer.test_freq=-1",
"trainer.resume_mode=disable",
f"trainer.device={trainer_device}",
f"trainer.default_local_dir={job_dir}/checkpoints",
f"trainer.nnodes={spec.nnodes}",
f"trainer.n_gpus_per_node={spec.n_gpus_per_node}",
f"hydra.run.dir={job_dir}/logs/hydra",
]
return BuiltCommand(argv=argv)
raise ValueError(f"unsupported workload: {spec.workload}")

View File

@ -0,0 +1,63 @@
from __future__ import annotations
import argparse
import os
import shlex
import subprocess
import sys
from pathlib import Path
def _preflight() -> None:
print("MVP_PRECHECK_PYTHON:", sys.executable, flush=True)
print("MVP_PRECHECK_PYTHONPATH:", os.environ.get("PYTHONPATH"), flush=True)
print("MVP_PRECHECK_MVP_CODE_PATH:", os.environ.get("MVP_CODE_PATH"), flush=True)
try:
import verl # type: ignore
print("MVP_PRECHECK_VERL_FILE:", getattr(verl, "__file__", None), flush=True)
except Exception as e:
print("MVP_PRECHECK_VERL_IMPORT_ERROR:", repr(e), flush=True)
try:
import mvp_marker # type: ignore
print("MVP_PRECHECK_MARKER:", getattr(mvp_marker, "MARKER", None), flush=True)
except Exception as e:
print("MVP_PRECHECK_MARKER_MISSING:", repr(e), flush=True)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--job-dir", required=True)
parser.add_argument("cmd", nargs=argparse.REMAINDER)
args = parser.parse_args()
job_dir = Path(args.job_dir)
job_dir.mkdir(parents=True, exist_ok=True)
_preflight()
if not args.cmd:
print("no command provided", file=sys.stderr)
return 2
# argparse includes the leading "--" if the caller uses it; strip it.
cmd = list(args.cmd)
if cmd and cmd[0] == "--":
cmd = cmd[1:]
if not cmd:
print("no command provided", file=sys.stderr)
return 2
# Execute training command as a subprocess so that logs are captured by Ray job logs.
cmd_str = " ".join(shlex.quote(x) for x in cmd)
print("MVP_DRIVER_EXEC:", cmd_str, flush=True)
proc = subprocess.run(cmd, check=False)
print("MVP_DRIVER_EXIT_CODE:", proc.returncode, flush=True)
return proc.returncode
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -0,0 +1,121 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Any
def _require(d: dict[str, Any], key: str) -> Any:
if key not in d or d[key] in (None, ""):
raise ValueError(f"missing required field: {key}")
return d[key]
@dataclass(frozen=True)
class RayConfig:
address: str
shared_root: str
entrypoint_num_cpus: float
entrypoint_resources: dict[str, float]
runtime_env_env_vars: dict[str, str]
user_code_path: str
@staticmethod
def from_dict(d: dict[str, Any]) -> "RayConfig":
runtime_env = d.get("runtime_env") or {}
env_vars = (runtime_env.get("env_vars") or {}) if isinstance(runtime_env, dict) else {}
if not isinstance(env_vars, dict):
raise ValueError("runtime_env.env_vars must be a mapping")
entrypoint_resources = d.get("entrypoint_resources") or {}
if not isinstance(entrypoint_resources, dict):
raise ValueError("entrypoint_resources must be a mapping")
return RayConfig(
address=str(_require(d, "address")),
shared_root=str(_require(d, "shared_root")),
entrypoint_num_cpus=float(d.get("entrypoint_num_cpus", 1)),
entrypoint_resources={str(k): float(v) for k, v in entrypoint_resources.items()},
runtime_env_env_vars={str(k): str(v) for k, v in env_vars.items()},
user_code_path=str(d.get("user_code_path", f"{_require(d, 'shared_root')}/user/code")),
)
def to_public_dict(self) -> dict[str, Any]:
return {
"address": self.address,
"shared_root": self.shared_root,
"entrypoint_num_cpus": self.entrypoint_num_cpus,
"entrypoint_resources": self.entrypoint_resources,
"runtime_env": {"env_vars": self.runtime_env_env_vars},
"user_code_path": self.user_code_path,
}
@dataclass(frozen=True)
class JobSpec:
workload: str # ppo|grpo|sft
submission_id: str | None
code_path: str
model_id: str
train_file: str
val_file: str | None
nnodes: int
n_gpus_per_node: int
total_epochs: int
total_training_steps: int
save_freq: int
test_freq: int | None
trainer_device: str | None # only for sft (driver-side device)
@staticmethod
def from_dict(d: dict[str, Any]) -> "JobSpec":
workload = str(_require(d, "workload"))
if workload not in ("ppo", "grpo", "sft"):
raise ValueError(f"unsupported workload: {workload}")
val_file = d.get("val_file", None)
if val_file in ("", "null"):
val_file = None
test_freq = d.get("test_freq", None)
if test_freq in ("", "null"):
test_freq = None
return JobSpec(
workload=workload,
submission_id=(str(d["submission_id"]) if d.get("submission_id") else None),
code_path=str(_require(d, "code_path")),
model_id=str(_require(d, "model_id")),
train_file=str(_require(d, "train_file")),
val_file=(str(val_file) if val_file is not None else None),
nnodes=int(d.get("nnodes", 2)),
n_gpus_per_node=int(d.get("n_gpus_per_node", 4)),
total_epochs=int(d.get("total_epochs", 1)),
total_training_steps=int(d.get("total_training_steps", 10)),
save_freq=int(d.get("save_freq", 10)),
test_freq=(int(test_freq) if test_freq is not None else None),
trainer_device=(str(d.get("trainer_device")) if d.get("trainer_device") else None),
)
def to_public_dict(self) -> dict[str, Any]:
out: dict[str, Any] = {
"workload": self.workload,
"submission_id": self.submission_id or "",
"code_path": self.code_path,
"model_id": self.model_id,
"train_file": self.train_file,
"val_file": self.val_file,
"nnodes": self.nnodes,
"n_gpus_per_node": self.n_gpus_per_node,
"total_epochs": self.total_epochs,
"total_training_steps": self.total_training_steps,
"save_freq": self.save_freq,
"test_freq": self.test_freq,
}
if self.workload == "sft":
out["trainer_device"] = self.trainer_device or "cpu"
return out

View File

@ -0,0 +1,171 @@
from __future__ import annotations
import json
import os
import shlex
from datetime import datetime
from pathlib import Path
from typing import Any
import ray
from ray.job_submission import JobSubmissionClient
from .builders import build_training_argv
from .models import JobSpec, RayConfig
from .yaml_io import dump_yaml
def _ts() -> str:
return datetime.now().strftime("%Y%m%d_%H%M%S")
def _mkdir(p: Path) -> None:
p.mkdir(parents=True, exist_ok=True)
def _write_text(p: Path, content: str) -> None:
_mkdir(p.parent)
p.write_text(content, encoding="utf-8")
def _write_json(p: Path, obj: Any) -> None:
_write_text(p, json.dumps(obj, indent=2, ensure_ascii=False) + "\n")
def _safe_basename(path: str) -> str:
return path.rstrip("/").split("/")[-1]
class RayJobTool:
def __init__(self, cfg: RayConfig):
self.cfg = cfg
self.client = JobSubmissionClient(cfg.address)
def _job_dir(self, submission_id: str) -> str:
return f"{self.cfg.shared_root}/jobs/{submission_id}"
def _runtime_env(self, spec: JobSpec) -> dict[str, Any]:
env_vars = dict(self.cfg.runtime_env_env_vars)
# Default HF cache
env_vars.setdefault("HF_HOME", f"{self.cfg.shared_root}/hf")
env_vars.setdefault("HUGGINGFACE_HUB_CACHE", f"{self.cfg.shared_root}/hf/hub")
env_vars.setdefault("TRANSFORMERS_CACHE", f"{self.cfg.shared_root}/hf/transformers")
env_vars.setdefault("PYTHONUNBUFFERED", "1")
# Tool code path must be importable on workers (compose mounts v1.1 into all containers).
# Place it before verl code to avoid interfering with verl import priority.
tool_code_path = os.environ.get("MVP_TOOL_CODE_PATH", "/workspace/mvp/v1.1/py")
user_code_path = self.cfg.user_code_path
code_path = spec.code_path
existing = env_vars.get("PYTHONPATH", "")
prefix = f"{tool_code_path}:{code_path}:{user_code_path}"
env_vars["PYTHONPATH"] = f"{prefix}:{existing}" if existing else prefix
# For debugging / log visibility
env_vars["MVP_CODE_PATH"] = code_path
# SFT: ensure ray.init() connects to the cluster
if spec.workload == "sft":
env_vars.setdefault("RAY_ADDRESS", "auto")
return {"env_vars": env_vars}
def submit(self, spec: JobSpec, no_wait: bool) -> str:
submission_id = spec.submission_id or f"mvp11_{spec.workload}_{_ts()}_{os.getpid()}"
job_dir = self._job_dir(submission_id)
built = build_training_argv(spec, submission_id=submission_id, job_dir=job_dir)
entrypoint_argv = [
"python3",
"-m",
"mvp_v11.driver_entrypoint",
"--job-dir",
job_dir,
*built.argv,
]
entrypoint = " ".join(shlex.quote(x) for x in entrypoint_argv)
runtime_env = self._runtime_env(spec)
# Prepare job artifacts directory
job_root = Path(job_dir)
_mkdir(job_root / "config")
_mkdir(job_root / "logs")
_mkdir(job_root / "debug")
_mkdir(job_root / "checkpoints")
_write_text(job_root / "config" / "ray_config.yaml", dump_yaml(self.cfg.to_public_dict()))
_write_text(job_root / "config" / "jobspec.yaml", dump_yaml(spec.to_public_dict()))
_write_json(job_root / "config" / "submit_payload.json", {
"submission_id": submission_id,
"address": self.cfg.address,
"entrypoint": entrypoint,
"entrypoint_num_cpus": self.cfg.entrypoint_num_cpus,
"entrypoint_resources": self.cfg.entrypoint_resources,
"runtime_env": runtime_env,
})
# Pre-submit debug snapshot (ray cluster resources via ray.init)
try:
ray.init(address="auto", ignore_reinit_error=True, log_to_driver=False)
_write_json(job_root / "debug" / "ray_cluster_resources_pre.json", ray.cluster_resources())
_write_json(job_root / "debug" / "ray_available_resources_pre.json", ray.available_resources())
except Exception as e:
_write_text(job_root / "debug" / "ray_resources_pre.error.txt", repr(e) + "\n")
try:
submitted = self.client.submit_job(
entrypoint=entrypoint,
submission_id=submission_id,
runtime_env=runtime_env,
entrypoint_num_cpus=self.cfg.entrypoint_num_cpus,
entrypoint_resources=self.cfg.entrypoint_resources,
)
except Exception as e:
_write_text(job_root / "logs" / "submit.error.txt", repr(e) + "\n")
raise
_write_text(job_root / "config" / "ray_submission_id.txt", submitted + "\n")
# Post-submit debug snapshot via SDK
try:
jobs = self.client.list_jobs()
_write_text(
job_root / "debug" / "ray_job_list_post.json",
json.dumps([_job_details_to_dict(j) for j in jobs], indent=2) + "\n",
)
except Exception as e:
_write_text(job_root / "debug" / "ray_job_list_post.error.txt", repr(e) + "\n")
if not no_wait:
# caller can separately wait; keep submit non-blocking by default in scripts
pass
return submitted
def status(self, submission_id: str) -> str:
return str(self.client.get_job_status(submission_id))
def stop(self, submission_id: str) -> bool:
return bool(self.client.stop_job(submission_id))
def logs(self, submission_id: str) -> str:
return self.client.get_job_logs(submission_id)
def list(self) -> list[dict[str, Any]]:
return [_job_details_to_dict(j) for j in self.client.list_jobs()]
def _job_details_to_dict(obj: Any) -> dict[str, Any]:
# Ray uses pydantic models internally, but depending on bundled pydantic version
# we might get `.model_dump()` (v2) or `.dict()` (v1).
if hasattr(obj, "model_dump"):
return obj.model_dump() # type: ignore[no-any-return]
if hasattr(obj, "dict"):
return obj.dict() # type: ignore[no-any-return]
if hasattr(obj, "__dict__"):
return dict(obj.__dict__)
return {"repr": repr(obj)}

View File

@ -0,0 +1,21 @@
from __future__ import annotations
from pathlib import Path
from typing import Any
import yaml
def load_yaml(path: str) -> dict[str, Any]:
p = Path(path)
data = yaml.safe_load(p.read_text(encoding="utf-8"))
if data is None:
return {}
if not isinstance(data, dict):
raise ValueError(f"yaml root must be a mapping: {path}")
return data
def dump_yaml(data: dict[str, Any]) -> str:
return yaml.safe_dump(data, sort_keys=False, allow_unicode=True)

View File

@ -0,0 +1,2 @@
__all__ = []

View File

@ -0,0 +1,190 @@
from __future__ import annotations
import os
import threading
from typing import Any
import yaml
from fastapi import FastAPI, HTTPException, Request, Response
from mvp_v11.models import JobSpec, RayConfig
from .config import V2Config
from .db import Db
from .ids import new_task_id
from .scheduler import Scheduler
def _utc_now_iso() -> str:
from datetime import datetime
return datetime.utcnow().replace(microsecond=0).isoformat() + "Z"
def _load_yaml_file(path: str) -> dict[str, Any]:
with open(path, "r", encoding="utf-8") as f:
obj = yaml.safe_load(f) or {}
if not isinstance(obj, dict):
raise ValueError("config yaml must be a mapping")
return obj
def create_app(config_path: str) -> FastAPI:
root = _load_yaml_file(config_path)
ray_cfg = RayConfig.from_dict(root)
v2_cfg = V2Config.from_root_dict(root)
db = Db(v2_cfg.sqlite.db_path)
db.init()
scheduler = Scheduler(db=db, ray_cfg=ray_cfg, v2_cfg=v2_cfg)
stop_flag = threading.Event()
tool = scheduler.tool
app = FastAPI(title="mvp-v2", version="2.0")
def _require_token(req: Request) -> None:
token_env = v2_cfg.auth.token_env
expected = os.environ.get(token_env, "")
if not expected:
# Misconfigured service; treat as server error.
raise HTTPException(status_code=500, detail=f"missing token env: {token_env}")
auth = req.headers.get("authorization") or ""
if not auth.startswith("Bearer "):
raise HTTPException(status_code=401, detail="missing bearer token")
got = auth.removeprefix("Bearer ").strip()
if got != expected:
raise HTTPException(status_code=401, detail="invalid token")
@app.on_event("startup")
def _startup() -> None:
t = threading.Thread(target=scheduler.run_forever, args=(stop_flag,), daemon=True)
t.start()
@app.on_event("shutdown")
def _shutdown() -> None:
stop_flag.set()
@app.post("/api/v2/tasks")
async def submit_task(req: Request) -> dict[str, Any]:
_require_token(req)
body = (await req.body()).decode("utf-8")
obj = yaml.safe_load(body) or {}
if not isinstance(obj, dict):
raise HTTPException(status_code=400, detail="jobspec must be a YAML mapping")
try:
spec = JobSpec.from_dict(obj)
except Exception as e:
raise HTTPException(status_code=400, detail=f"invalid jobspec: {e!r}")
task_id = new_task_id(spec.workload)
db.create_task(
task_id=task_id,
workload=spec.workload,
jobspec_yaml=body,
nnodes=spec.nnodes,
n_gpus_per_node=spec.n_gpus_per_node,
)
return {"task_id": task_id, "state": "QUEUED"}
@app.get("/api/v2/tasks/{task_id}")
async def get_task(task_id: str, req: Request) -> dict[str, Any]:
_require_token(req)
row = db.get_task(task_id)
if not row:
raise HTTPException(status_code=404, detail="task not found")
attempts = db.list_attempts(task_id)
latest_attempt = attempts[-1] if attempts else None
desired = {
"nnodes": int(row["nnodes"]),
"n_gpus_per_node": int(row["n_gpus_per_node"]),
"total_gpus": int(row["nnodes"]) * int(row["n_gpus_per_node"]),
}
out: dict[str, Any] = {
"task_id": row["task_id"],
"workload": row["workload"],
"state": row["state"],
"created_at": row.get("created_at"),
"updated_at": row.get("updated_at"),
"desired_resources": desired,
"error_summary": row.get("error_summary"),
}
if latest_attempt:
out["latest_attempt"] = {
"attempt_no": latest_attempt["attempt_no"],
"ray_submission_id": latest_attempt["ray_submission_id"],
"ray_status": latest_attempt.get("ray_status"),
"start_time": latest_attempt.get("start_time"),
"end_time": latest_attempt.get("end_time"),
"failure_kind": latest_attempt.get("failure_kind"),
"message": latest_attempt.get("message"),
}
return out
@app.get("/api/v2/tasks/{task_id}/attempts")
async def get_attempts(task_id: str, req: Request) -> dict[str, Any]:
_require_token(req)
row = db.get_task(task_id)
if not row:
raise HTTPException(status_code=404, detail="task not found")
return {"task_id": task_id, "attempts": db.list_attempts(task_id)}
@app.post("/api/v2/tasks/{task_id}:cancel")
async def cancel(task_id: str, req: Request) -> dict[str, Any]:
_require_token(req)
row = db.get_task(task_id)
if not row:
raise HTTPException(status_code=404, detail="task not found")
state = str(row["state"])
if state in ("SUCCEEDED", "FAILED", "CANCELED"):
raise HTTPException(status_code=409, detail=f"task already terminal: {state}")
attempts = db.list_attempts(task_id)
if attempts:
ray_sid = str(attempts[-1]["ray_submission_id"])
try:
tool.stop(ray_sid)
except Exception:
pass
# Mark attempt as canceled on the service side so that API doesn't keep reporting RUNNING.
# Ray stop is async; we deliberately reflect the user's intent here.
db.update_attempt(
task_id=task_id,
attempt_no=int(attempts[-1]["attempt_no"]),
ray_status="STOPPED",
failure_kind="CANCELED",
message="Canceled by user via API (Ray stop requested).",
end_time=_utc_now_iso(),
)
db.set_task_state(task_id=task_id, state="CANCELED", event_type="CANCELED")
return {"task_id": task_id, "state": "CANCELED"}
@app.get("/api/v2/tasks/{task_id}/logs")
async def logs(task_id: str, req: Request, tail: int = 2000, attempt: str = "latest") -> Response:
_require_token(req)
row = db.get_task(task_id)
if not row:
raise HTTPException(status_code=404, detail="task not found")
attempts = db.list_attempts(task_id)
if not attempts:
raise HTTPException(status_code=404, detail="no attempts yet")
a = attempts[-1] if attempt == "latest" else None
if a is None:
raise HTTPException(status_code=400, detail="only attempt=latest supported in v2.0")
ray_sid = str(a["ray_submission_id"])
text = tool.logs(ray_sid)
if tail and tail > 0:
lines = text.splitlines()
text = "\n".join(lines[-tail:]) + ("\n" if lines else "")
return Response(content=text, media_type="text/plain")
@app.get("/api/v2/queue")
async def queue(req: Request) -> dict[str, Any]:
_require_token(req)
return db.list_queue()
return app

View File

@ -0,0 +1,68 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Any
@dataclass(frozen=True)
class V2ApiConfig:
host: str = "0.0.0.0"
port: int = 8080
@dataclass(frozen=True)
class V2AuthConfig:
token_env: str = "MVP_INTERNAL_TOKEN"
@dataclass(frozen=True)
class V2SqliteConfig:
db_path: str
@dataclass(frozen=True)
class V2SchedulerConfig:
tick_s: int = 5
retry_interval_s: int = 60
max_running_tasks: int = 1
@dataclass(frozen=True)
class V2Config:
api: V2ApiConfig
auth: V2AuthConfig
sqlite: V2SqliteConfig
scheduler: V2SchedulerConfig
@staticmethod
def from_root_dict(root: dict[str, Any]) -> "V2Config":
v2 = root.get("v2") or {}
if not isinstance(v2, dict):
raise ValueError("config.v2 must be a mapping")
api = v2.get("api") or {}
auth = v2.get("auth") or {}
sqlite = v2.get("sqlite") or {}
scheduler = v2.get("scheduler") or {}
if not isinstance(api, dict) or not isinstance(auth, dict) or not isinstance(sqlite, dict) or not isinstance(scheduler, dict):
raise ValueError("config.v2.{api,auth,sqlite,scheduler} must be mappings")
shared_root = str(root.get("shared_root") or "/private")
default_db_path = f"{shared_root}/common/db/mvp_v2.sqlite3"
db_path = str(sqlite.get("db_path") or default_db_path)
return V2Config(
api=V2ApiConfig(
host=str(api.get("host") or "0.0.0.0"),
port=int(api.get("port") or 8080),
),
auth=V2AuthConfig(token_env=str(auth.get("token_env") or "MVP_INTERNAL_TOKEN")),
sqlite=V2SqliteConfig(db_path=db_path),
scheduler=V2SchedulerConfig(
tick_s=int(scheduler.get("tick_s") or 5),
retry_interval_s=int(scheduler.get("retry_interval_s") or 60),
max_running_tasks=int(scheduler.get("max_running_tasks") or 1),
),
)

View File

@ -0,0 +1,254 @@
from __future__ import annotations
import os
import sqlite3
from contextlib import contextmanager
from dataclasses import dataclass
from typing import Any, Iterator
def _utc_now_iso() -> str:
# Keep it simple; wall-clock ordering only.
import datetime as _dt
return _dt.datetime.utcnow().replace(microsecond=0).isoformat() + "Z"
@dataclass(frozen=True)
class Db:
db_path: str
def _connect(self) -> sqlite3.Connection:
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
conn = sqlite3.connect(self.db_path, timeout=30, isolation_level=None)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL;")
conn.execute("PRAGMA foreign_keys=ON;")
return conn
def init(self) -> None:
with self._connect() as conn:
conn.execute(
"""
CREATE TABLE IF NOT EXISTS tasks (
task_id TEXT PRIMARY KEY,
workload TEXT NOT NULL,
state TEXT NOT NULL,
jobspec_yaml TEXT NOT NULL,
nnodes INTEGER NOT NULL,
n_gpus_per_node INTEGER NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
next_run_at TEXT,
error_summary TEXT,
latest_attempt_no INTEGER NOT NULL DEFAULT 0
)
"""
)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS attempts (
task_id TEXT NOT NULL,
attempt_no INTEGER NOT NULL,
ray_submission_id TEXT NOT NULL UNIQUE,
ray_status TEXT,
failure_kind TEXT,
message TEXT,
start_time TEXT NOT NULL,
end_time TEXT,
PRIMARY KEY (task_id, attempt_no),
FOREIGN KEY (task_id) REFERENCES tasks(task_id) ON DELETE CASCADE
)
"""
)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_id TEXT,
ts TEXT NOT NULL,
event_type TEXT NOT NULL,
payload_json TEXT,
FOREIGN KEY (task_id) REFERENCES tasks(task_id) ON DELETE CASCADE
)
"""
)
@contextmanager
def tx(self) -> Iterator[sqlite3.Connection]:
conn = self._connect()
try:
conn.execute("BEGIN IMMEDIATE;")
yield conn
conn.execute("COMMIT;")
except Exception:
conn.execute("ROLLBACK;")
raise
finally:
conn.close()
def create_task(self, *, task_id: str, workload: str, jobspec_yaml: str, nnodes: int, n_gpus_per_node: int) -> dict[str, Any]:
now = _utc_now_iso()
with self.tx() as conn:
conn.execute(
"""
INSERT INTO tasks (task_id, workload, state, jobspec_yaml, nnodes, n_gpus_per_node, created_at, updated_at)
VALUES (?, ?, 'QUEUED', ?, ?, ?, ?, ?)
""",
(task_id, workload, jobspec_yaml, nnodes, n_gpus_per_node, now, now),
)
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'TASK_CREATED', ?)",
(task_id, now, None),
)
row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone()
return dict(row) if row else {}
def get_task(self, task_id: str) -> dict[str, Any] | None:
with self._connect() as conn:
row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone()
return dict(row) if row else None
def list_attempts(self, task_id: str) -> list[dict[str, Any]]:
with self._connect() as conn:
rows = conn.execute(
"SELECT * FROM attempts WHERE task_id = ? ORDER BY attempt_no ASC", (task_id,)
).fetchall()
return [dict(r) for r in rows]
def list_queue(self) -> dict[str, list[dict[str, Any]]]:
with self._connect() as conn:
pending = conn.execute(
"""
SELECT task_id, workload, state, nnodes, n_gpus_per_node, next_run_at, created_at, updated_at
FROM tasks
WHERE state IN ('QUEUED','PENDING_RESOURCES')
ORDER BY created_at ASC
LIMIT 200
"""
).fetchall()
running = conn.execute(
"""
SELECT task_id, workload, state, nnodes, n_gpus_per_node, latest_attempt_no, created_at, updated_at
FROM tasks
WHERE state IN ('SUBMITTING','SUBMITTED','RUNNING')
ORDER BY updated_at ASC
LIMIT 200
"""
).fetchall()
return {"pending": [dict(r) for r in pending], "running": [dict(r) for r in running]}
def count_running(self) -> int:
with self._connect() as conn:
row = conn.execute(
"SELECT COUNT(1) AS n FROM tasks WHERE state IN ('SUBMITTING','SUBMITTED','RUNNING')"
).fetchone()
return int(row["n"]) if row else 0
def list_active_tasks(self, limit: int = 50) -> list[dict[str, Any]]:
with self._connect() as conn:
rows = conn.execute(
"SELECT * FROM tasks WHERE state IN ('SUBMITTING','SUBMITTED','RUNNING') ORDER BY updated_at ASC LIMIT ?",
(int(limit),),
).fetchall()
return [dict(r) for r in rows]
def pick_next_runnable_task(self) -> dict[str, Any] | None:
now = _utc_now_iso()
with self._connect() as conn:
row = conn.execute(
"""
SELECT *
FROM tasks
WHERE state IN ('QUEUED','PENDING_RESOURCES')
AND (next_run_at IS NULL OR next_run_at <= ?)
ORDER BY created_at ASC
LIMIT 1
""",
(now,),
).fetchone()
return dict(row) if row else None
def set_task_state(
self,
*,
task_id: str,
state: str,
error_summary: str | None = None,
next_run_at: str | None = None,
latest_attempt_no: int | None = None,
event_type: str = "STATE_UPDATE",
payload_json: str | None = None,
) -> None:
now = _utc_now_iso()
with self.tx() as conn:
sets = ["state = ?", "updated_at = ?"]
params: list[Any] = [state, now]
if error_summary is not None:
sets.append("error_summary = ?")
params.append(error_summary)
if next_run_at is not None:
sets.append("next_run_at = ?")
params.append(next_run_at)
if latest_attempt_no is not None:
sets.append("latest_attempt_no = ?")
params.append(int(latest_attempt_no))
params.append(task_id)
conn.execute(f"UPDATE tasks SET {', '.join(sets)} WHERE task_id = ?", tuple(params))
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, ?, ?)",
(task_id, now, event_type, payload_json),
)
def create_attempt(self, *, task_id: str, attempt_no: int, ray_submission_id: str) -> None:
now = _utc_now_iso()
with self.tx() as conn:
conn.execute(
"""
INSERT INTO attempts (task_id, attempt_no, ray_submission_id, ray_status, start_time)
VALUES (?, ?, ?, ?, ?)
""",
(task_id, attempt_no, ray_submission_id, "SUBMITTING", now),
)
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_CREATED', ?)",
(task_id, now, None),
)
def update_attempt(
self,
*,
task_id: str,
attempt_no: int,
ray_status: str | None = None,
failure_kind: str | None = None,
message: str | None = None,
end_time: str | None = None,
) -> None:
now = _utc_now_iso()
with self.tx() as conn:
sets = []
params: list[Any] = []
if ray_status is not None:
sets.append("ray_status = ?")
params.append(ray_status)
if failure_kind is not None:
sets.append("failure_kind = ?")
params.append(failure_kind)
if message is not None:
sets.append("message = ?")
params.append(message)
if end_time is not None:
sets.append("end_time = ?")
params.append(end_time)
if not sets:
return
params.extend([task_id, attempt_no])
conn.execute(
f"UPDATE attempts SET {', '.join(sets)} WHERE task_id = ? AND attempt_no = ?",
tuple(params),
)
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_UPDATE', ?)",
(task_id, now, None),
)

View File

@ -0,0 +1,15 @@
from __future__ import annotations
import secrets
from datetime import datetime
def new_task_id(workload: str) -> str:
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
suffix = secrets.token_hex(2)
return f"mvp2-{workload}-{ts}-{suffix}"
def attempt_submission_id(task_id: str, attempt_no: int) -> str:
return f"{task_id}--a{attempt_no:02d}"

View File

@ -0,0 +1,37 @@
from __future__ import annotations
from dataclasses import dataclass
import ray
@dataclass(frozen=True)
class ClusterAvailable:
total_available_gpus: float
total_available_npus: float
def get_cluster_available() -> ClusterAvailable:
# Align with verl's fail-fast check which uses ray._private.state.available_resources_per_node().
# This is a best-effort internal API and may change with Ray versions.
try:
import ray._private.state # type: ignore
per_node = ray._private.state.available_resources_per_node()
except Exception:
# If we cannot fetch per-node resources, conservatively return 0.
return ClusterAvailable(total_available_gpus=0.0, total_available_npus=0.0)
total_gpu = 0.0
total_npu = 0.0
for _, info in per_node.items():
if not isinstance(info, dict):
continue
total_gpu += float(info.get("GPU", 0) or 0)
total_npu += float(info.get("NPU", 0) or 0)
return ClusterAvailable(total_available_gpus=total_gpu, total_available_npus=total_npu)
def ensure_ray_connected() -> None:
ray.init(address="auto", ignore_reinit_error=True, log_to_driver=False)

View File

@ -0,0 +1,185 @@
from __future__ import annotations
import re
import time
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Any
import yaml
from mvp_v11.models import JobSpec, RayConfig
from mvp_v11.ray_job_tool import RayJobTool
from .config import V2Config
from .db import Db
from .ids import attempt_submission_id
from .ray_resources import ensure_ray_connected, get_cluster_available
_INSUFFICIENT_RE = re.compile(r"Total available GPUs\\s+\\d+\\s+is less than total desired GPUs\\s+\\d+")
def _utc_now_iso() -> str:
return datetime.utcnow().replace(microsecond=0).isoformat() + "Z"
def _utc_after_s(seconds: int) -> str:
return (datetime.utcnow() + timedelta(seconds=seconds)).replace(microsecond=0).isoformat() + "Z"
@dataclass
class Scheduler:
db: Db
ray_cfg: RayConfig
v2_cfg: V2Config
def __post_init__(self) -> None:
self.tool = RayJobTool(self.ray_cfg)
def _resources_sufficient(self, *, nnodes: int, n_gpus_per_node: int) -> bool:
avail = get_cluster_available()
required = float(nnodes * n_gpus_per_node)
return avail.total_available_gpus >= required
def _parse_jobspec(self, jobspec_yaml: str) -> JobSpec:
obj = yaml.safe_load(jobspec_yaml) or {}
if not isinstance(obj, dict):
raise ValueError("jobspec must be a YAML mapping")
return JobSpec.from_dict(obj)
def _submit_one(self, task_row: dict[str, Any]) -> None:
task_id = str(task_row["task_id"])
jobspec_yaml = str(task_row["jobspec_yaml"])
spec = self._parse_jobspec(jobspec_yaml)
attempt_no = int(task_row.get("latest_attempt_no", 0)) + 1
ray_sid = attempt_submission_id(task_id, attempt_no)
# Record attempt first so that we can surface it even if submit crashes.
self.db.create_attempt(task_id=task_id, attempt_no=attempt_no, ray_submission_id=ray_sid)
self.db.set_task_state(task_id=task_id, state="SUBMITTING", latest_attempt_no=attempt_no)
# Override submission_id in jobspec (v1.1 compatible)
d = spec.to_public_dict()
d["submission_id"] = ray_sid
spec2 = JobSpec.from_dict(d)
try:
submitted = self.tool.submit(spec2, no_wait=True)
# submitted should equal ray_sid; keep as source of truth.
self.db.update_attempt(task_id=task_id, attempt_no=attempt_no, ray_status="SUBMITTED")
self.db.set_task_state(task_id=task_id, state="SUBMITTED")
if submitted != ray_sid:
self.db.set_task_state(task_id=task_id, state="SUBMITTED", event_type="WARN_SUBMISSION_ID_MISMATCH")
except Exception as e:
msg = repr(e)
self.db.update_attempt(task_id=task_id, attempt_no=attempt_no, ray_status="FAILED", failure_kind="UNKNOWN", message=msg, end_time=_utc_now_iso())
self.db.set_task_state(task_id=task_id, state="FAILED", error_summary=msg)
def _sync_one_running(self, task_row: dict[str, Any]) -> None:
task_id = str(task_row["task_id"])
latest_attempt_no = int(task_row.get("latest_attempt_no", 0))
if latest_attempt_no <= 0:
return
# Look up ray_submission_id
attempts = self.db.list_attempts(task_id)
if not attempts:
return
ray_sid = str(attempts[-1]["ray_submission_id"])
try:
st = self.tool.status(ray_sid)
except Exception as e:
# Keep current state; transient failures should not flap tasks.
self.db.set_task_state(task_id=task_id, state=str(task_row["state"]), event_type="RAY_STATUS_ERROR", payload_json=repr(e))
return
st_s = str(st)
if st_s in ("PENDING", "RUNNING"):
self.db.update_attempt(task_id=task_id, attempt_no=latest_attempt_no, ray_status=st_s)
self.db.set_task_state(task_id=task_id, state=("RUNNING" if st_s == "RUNNING" else "SUBMITTED"))
return
if st_s in ("SUCCEEDED",):
self.db.update_attempt(task_id=task_id, attempt_no=latest_attempt_no, ray_status=st_s, end_time=_utc_now_iso())
self.db.set_task_state(task_id=task_id, state="SUCCEEDED")
return
if st_s in ("FAILED", "STOPPED"):
logs = ""
try:
logs = self.tool.logs(ray_sid)
except Exception:
logs = ""
failure_kind = "UNKNOWN"
msg = ""
if _INSUFFICIENT_RE.search(logs):
failure_kind = "INSUFFICIENT_RESOURCES"
msg = "Insufficient resources (verl fail-fast): " + (_INSUFFICIENT_RE.search(logs).group(0))
self.db.update_attempt(
task_id=task_id,
attempt_no=latest_attempt_no,
ray_status=st_s,
failure_kind=failure_kind,
message=msg,
end_time=_utc_now_iso(),
)
self.db.set_task_state(
task_id=task_id,
state="PENDING_RESOURCES",
error_summary=msg,
next_run_at=_utc_after_s(self.v2_cfg.scheduler.retry_interval_s),
event_type="RETRY_SCHEDULED",
)
return
msg = f"Ray job {st_s}"
self.db.update_attempt(
task_id=task_id,
attempt_no=latest_attempt_no,
ray_status=st_s,
failure_kind=failure_kind,
message=msg,
end_time=_utc_now_iso(),
)
self.db.set_task_state(task_id=task_id, state="FAILED", error_summary=msg)
def tick(self) -> None:
ensure_ray_connected()
# Sync active tasks
for row in self.db.list_active_tasks(limit=50):
self._sync_one_running(row)
# Submit new tasks if capacity allows
if self.db.count_running() >= self.v2_cfg.scheduler.max_running_tasks:
return
row = self.db.pick_next_runnable_task()
if not row:
return
nnodes = int(row["nnodes"])
n_gpus_per_node = int(row["n_gpus_per_node"])
if not self._resources_sufficient(nnodes=nnodes, n_gpus_per_node=n_gpus_per_node):
self.db.set_task_state(
task_id=str(row["task_id"]),
state="PENDING_RESOURCES",
next_run_at=_utc_after_s(self.v2_cfg.scheduler.retry_interval_s),
event_type="PENDING_RESOURCES",
)
return
self._submit_one(row)
def run_forever(self, stop_flag: Any) -> None:
while not stop_flag.is_set():
try:
self.tick()
except Exception:
# Best-effort: don't crash the scheduler loop
pass
time.sleep(max(1, int(self.v2_cfg.scheduler.tick_s)))

View File

@ -0,0 +1,4 @@
fastapi==0.115.6
uvicorn==0.30.6
PyYAML==6.0.2

33
src/mvp/v2.0/py/server.py Normal file
View File

@ -0,0 +1,33 @@
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import uvicorn
from mvp_v2.app import create_app
from mvp_v2.config import V2Config
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--config", required=True, help="Path to v1.1 RayConfig YAML (extended with v2:)")
args = parser.parse_args()
# Load app and read v2.api host/port from config.
import yaml
with open(args.config, "r", encoding="utf-8") as f:
root = yaml.safe_load(f) or {}
if not isinstance(root, dict):
raise SystemExit("config yaml must be a mapping")
v2 = V2Config.from_root_dict(root)
app = create_app(args.config)
uvicorn.run(app, host=v2.api.host, port=v2.api.port, log_level="info")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
set -euo pipefail
# Install v2.0 API dependencies inside the head container (best-effort).
# Assumes v1.1 containers are already up and v2.0 code is mounted/available.
HEAD_CONTAINER="${HEAD_CONTAINER:-mvp11-ray-head}"
docker exec -i "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -U pip >/dev/null 2>&1 || true"
docker exec -i "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -r /workspace/mvp/v2/py/requirements.txt"

View File

@ -0,0 +1,28 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/v1.1/py/configs/dev.yaml}"
LOG_PATH="${LOG_PATH:-/private/common/logs/mvp_v2_api.log}"
PID_PATH="${PID_PATH:-/private/common/run/mvp_v2_api.pid}"
echo "[host] starting mvp v2 api in head container: ${HEAD_CONTAINER}"
dexec bash -lc "mkdir -p \"$(dirname "${LOG_PATH}")\" \"$(dirname "${PID_PATH}")\""
# Note: requires /workspace/mvp/v2.0/py to be present in the container (mounted or copied).
# Escape $ so that the command substitution happens in the container, not on the host.
dexec bash -lc "if test -f '${PID_PATH}'; then pid=\$(cat '${PID_PATH}'); if kill -0 \"\${pid}\" >/dev/null 2>&1; then echo 'already_running'; exit 0; fi; fi"
if [[ -z "${MVP_INTERNAL_TOKEN:-}" ]]; then
echo "ERROR: MVP_INTERNAL_TOKEN env var must be set on host (will be passed into container)" >&2
exit 1
fi
docker exec -d -e MVP_INTERNAL_TOKEN="${MVP_INTERNAL_TOKEN}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/v2/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'"
echo "[host] started; pid stored in ${PID_PATH} (container path)"
echo "[host] logs: ${LOG_PATH} (container path)"

View File

@ -0,0 +1,12 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
PID_PATH="${PID_PATH:-/private/common/run/mvp_v2_api.pid}"
echo "[host] stopping mvp v2 api (pid file: ${PID_PATH})"
dexec bash -lc "if ! test -f '${PID_PATH}'; then echo 'not_running'; exit 0; fi; pid=\"\$(cat '${PID_PATH}')\"; if kill -0 \"\${pid}\" >/dev/null 2>&1; then kill \"\${pid}\"; fi; rm -f '${PID_PATH}'; echo 'stopped'"

View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
PID_PATH="${PID_PATH:-/private/common/run/mvp_v2_api.pid}"
dexec bash -lc "if ! test -f '${PID_PATH}'; then echo 'not_running'; exit 0; fi; pid=\"\$(cat '${PID_PATH}')\"; if kill -0 \"\${pid}\" >/dev/null 2>&1; then echo \"running pid=\${pid}\"; else echo \"stale pid=\${pid}\"; fi"

12
src/mvp/v2.0/scripts/lib.sh Executable file
View File

@ -0,0 +1,12 @@
#!/usr/bin/env bash
set -euo pipefail
# v2.0 scripts are intended to run on the host and control the existing Ray containers
# (same topology as v1.1). Adjust container names via env vars if needed.
HEAD_CONTAINER="${HEAD_CONTAINER:-mvp11-ray-head}"
dexec() {
docker exec -i "${HEAD_CONTAINER}" "$@"
}