# MVP v2.0(服务化入口) v2.0 在 v1.1(Ray Jobs SDK 提交链路)基础上新增一个**服务层**: - HTTP API 提交任务(PPO/GRPO/SFT) - 服务侧队列 + gang 资源判断 - 识别 `verl` fail-fast 的资源不足失败并自动重试 - SQLite 持久化队列/状态/attempt(NFS:`/private`) 设计文档见: - `specs/mvp/v2.0/v2_plan.md` - `specs/mvp/v2.0/v2_api.md` ## 快速开始(dev 示例) 约定: - Ray 容器仍由 v1.1 的 `docker-compose.yaml` 启动(head+2 workers) - v2 代码在宿主机:`/home2/argus/infra/mvp/v2/`(容器内挂载 `/workspace/mvp/v2/`) - v2 配置复用 v1.1:`/workspace/mvp/v1.1/py/configs/dev.yaml`(扩展了 `v2:` 段) 宿主机执行: ```bash export MVP_INTERNAL_TOKEN=... # 内部 token cd /home2/argus/infra/mvp/v2/scripts ./12_install_v2_deps.sh ./20_start_api.sh ./22_status_api.sh ``` API 测试(宿主机): ```bash curl -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" http://127.0.0.1:8080/api/v2/queue ``` > 进程日志与 pid(容器内路径)默认在 `/private/common/logs/` 与 `/private/common/run/`。 ## 提交/查询/停止任务 约定: - API 地址(宿主机视角):`http://127.0.0.1:8080` - 鉴权:`Authorization: Bearer ${MVP_INTERNAL_TOKEN}` - 请求体:**raw YAML**(JobSpec,格式与 v1.1 jobspec 一致) ### 1) 提交任务(POST /api/v2/tasks) 准备一个 jobspec(示例:PPO): ```yaml workload: "ppo" submission_id: "" # v2 会忽略/覆盖,自动生成 task_id 并派生 ray_submission_id code_path: "/private/common/code/verl/verl_repo" model_id: "Qwen/Qwen2.5-0.5B-Instruct" train_file: "/private/datasets/gsm8k/train.parquet" val_file: "/private/datasets/gsm8k/test.parquet" nnodes: 2 n_gpus_per_node: 4 total_epochs: 1 total_training_steps: 10 save_freq: 10 test_freq: -1 trainer_device: null ``` 提交: ```bash curl -sS \ -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \ -H "Content-Type: application/yaml" \ --data-binary @jobspec.yaml \ http://127.0.0.1:8080/api/v2/tasks ``` 返回示例: ```json {"task_id":"mvp2-ppo-20251223-082813-6426","state":"QUEUED"} ``` ### 2) 查询任务(GET /api/v2/tasks/{task_id}) ```bash curl -sS \ -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \ http://127.0.0.1:8080/api/v2/tasks/ | python3 -m json.tool ``` 可选: - 查看 attempts:`GET /api/v2/tasks/{task_id}/attempts` - 拉取日志(latest attempt):`GET /api/v2/tasks/{task_id}/logs?tail=2000` - 查看队列:`GET /api/v2/queue` ### 3) 停止/取消任务(POST /api/v2/tasks/{task_id}:cancel) ```bash curl -sS -X POST \ -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \ http://127.0.0.1:8080/api/v2/tasks/:cancel ``` 说明: - 若任务已提交到 Ray(`SUBMITTED/RUNNING`),服务会调用 Ray Jobs SDK `stop_job(ray_submission_id)`。 - 若任务仍在队列(`QUEUED/PENDING_RESOURCES`),服务直接标记 `CANCELED`(不会产生 attempt)。