2.9 KiB
Raw Blame History

MVP v2.0(服务化入口)

v2.0 在 v1.1Ray Jobs SDK 提交链路)基础上新增一个服务层

  • HTTP API 提交任务PPO/GRPO/SFT
  • 服务侧队列 + gang 资源判断
  • 识别 verl fail-fast 的资源不足失败并自动重试
  • SQLite 持久化队列/状态/attemptNFS/private

设计文档见:

  • specs/mvp/v2.0/v2_plan.md
  • specs/mvp/v2.0/v2_api.md

快速开始dev 示例)

约定:

  • Ray 容器仍由 v1.1 的 docker-compose.yaml 启动head+2 workers
  • v2 代码在宿主机:/home2/argus/infra/mvp/v2/(容器内挂载 /workspace/mvp/v2/
  • v2 配置复用 v1.1/workspace/mvp/v1.1/py/configs/dev.yaml(扩展了 v2: 段)

宿主机执行:

export MVP_INTERNAL_TOKEN=...    # 内部 token
cd /home2/argus/infra/mvp/v2/scripts
./12_install_v2_deps.sh
./20_start_api.sh
./22_status_api.sh

API 测试(宿主机):

curl -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" http://127.0.0.1:8080/api/v2/queue

进程日志与 pid容器内路径默认在 /private/common/logs//private/common/run/

提交/查询/停止任务

约定:

  • API 地址(宿主机视角):http://127.0.0.1:8080
  • 鉴权:Authorization: Bearer ${MVP_INTERNAL_TOKEN}
  • 请求体:raw YAMLJobSpec格式与 v1.1 jobspec 一致)

1) 提交任务POST /api/v2/tasks

准备一个 jobspec示例PPO

workload: "ppo"
submission_id: ""              # v2 会忽略/覆盖,自动生成 task_id 并派生 ray_submission_id
code_path: "/private/common/code/verl/verl_repo"
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
train_file: "/private/datasets/gsm8k/train.parquet"
val_file: "/private/datasets/gsm8k/test.parquet"
nnodes: 2
n_gpus_per_node: 4
total_epochs: 1
total_training_steps: 10
save_freq: 10
test_freq: -1
trainer_device: null

提交:

curl -sS \
  -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \
  -H "Content-Type: application/yaml" \
  --data-binary @jobspec.yaml \
  http://127.0.0.1:8080/api/v2/tasks

返回示例:

{"task_id":"mvp2-ppo-20251223-082813-6426","state":"QUEUED"}

2) 查询任务GET /api/v2/tasks/{task_id}

curl -sS \
  -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \
  http://127.0.0.1:8080/api/v2/tasks/<task_id> | python3 -m json.tool

可选:

  • 查看 attemptsGET /api/v2/tasks/{task_id}/attempts
  • 拉取日志latest attemptGET /api/v2/tasks/{task_id}/logs?tail=2000
  • 查看队列:GET /api/v2/queue

3) 停止/取消任务POST /api/v2/tasks/{task_id}:cancel

curl -sS -X POST \
  -H "Authorization: Bearer ${MVP_INTERNAL_TOKEN}" \
  http://127.0.0.1:8080/api/v2/tasks/<task_id>:cancel

说明:

  • 若任务已提交到 RaySUBMITTED/RUNNING),服务会调用 Ray Jobs SDK stop_job(ray_submission_id)
  • 若任务仍在队列(QUEUED/PENDING_RESOURCES),服务直接标记 CANCELED(不会产生 attempt