195 lines
5.2 KiB
Markdown
195 lines
5.2 KiB
Markdown
# MVP v2.0 API 设计(最小可用)
|
||
|
||
v2.0 的 API 目标是:把 v1.1 的“脚本提交”变成“服务化提交”,并在服务侧实现队列/重试/状态聚合。
|
||
|
||
约束:
|
||
- 内部 token 鉴权(简单即可)。
|
||
- Ray Job 提交必须使用 **Ray Python SDK**(`JobSubmissionClient`),不使用 `requests` 手写 HTTP。
|
||
- 输出与状态必须落盘到 NFS(容器内 `/private`)。
|
||
|
||
---
|
||
|
||
## 1. 鉴权
|
||
|
||
- Header:`Authorization: Bearer <INTERNAL_TOKEN>`
|
||
- v2.0 不做用户体系与权限隔离;token 只是“防误用”。
|
||
- 配置建议:复用 `src/mvp/v1.1/py/configs/dev.yaml` 并在 `v2.auth.token_env` 指定 token 环境变量名。
|
||
|
||
## 1.1 运行位置(dev 示例)
|
||
|
||
- 服务进程运行在 **Ray head 容器**(便于访问 Ray Job server)。
|
||
- 宿主机侧用脚本控制(`docker exec`):
|
||
- `src/mvp/v2.0/scripts/20_start_api.sh`
|
||
- `src/mvp/v2.0/scripts/21_stop_api.sh`
|
||
- `src/mvp/v2.0/scripts/22_status_api.sh`
|
||
- 远程机目录约定(示例):`argus@h1:/home2/argus/infra/mvp/v2/`,容器内挂载到 `/workspace/mvp/v2/`。
|
||
|
||
---
|
||
|
||
## 2. 资源与 ID 约定
|
||
|
||
### 2.1 task_id(服务层主 ID)
|
||
|
||
- 格式建议:`mvp2-<workload>-<YYYYMMDD>-<HHMMSS>-<suffix>`
|
||
- 示例:`mvp2-ppo-20251223-143201-7f3a`
|
||
|
||
### 2.2 ray_submission_id(attempt 级 ID)
|
||
|
||
- 由 service 派生:`<task_id>--a<NN>`
|
||
- 示例:`mvp2-ppo-20251223-143201-7f3a--a01`
|
||
|
||
好处:
|
||
- Ray 的 submission id 自带 task_id,可直接从 Ray dashboard 反查到服务侧任务。
|
||
- `/private/jobs/<ray_submission_id>/...` 目录天然隔离且可读。
|
||
|
||
---
|
||
|
||
## 3. JobSpec(请求体)
|
||
|
||
v2.0 **要求 JobSpec 使用 v1.1 同款 YAML**(字段与语义保持一致),服务端接收 YAML 文本并解析后入库(同时原样保存 `jobspec_yaml` 便于审计/复现)。
|
||
|
||
最小字段(示例 YAML):
|
||
|
||
```yaml
|
||
workload: "ppo"
|
||
submission_id: "" # v2.0 服务端会忽略/覆盖(由 task_id 派生 ray_submission_id)
|
||
code_path: "/private/common/code/verl/verl_repo"
|
||
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
|
||
train_file: "/private/datasets/gsm8k/train.parquet"
|
||
val_file: "/private/datasets/gsm8k/test.parquet"
|
||
nnodes: 2
|
||
n_gpus_per_node: 4
|
||
total_epochs: 1
|
||
total_training_steps: 10
|
||
save_freq: 10
|
||
test_freq: -1
|
||
trainer_device: null # 仅 sft 使用(通常 "cpu")
|
||
```
|
||
|
||
说明:
|
||
- `trainer_device` 仅对 `sft` 生效(通常为 `cpu`,避免 driver 无 GPU)。
|
||
- `val_file` 可为 `null`(例如 SFT)。
|
||
|
||
---
|
||
|
||
## 4. API 端点
|
||
|
||
### 4.1 提交任务
|
||
|
||
`POST /api/v2/tasks`
|
||
|
||
Request body:
|
||
- **raw JobSpec YAML**(与 v1.1 jobspec YAML 结构一致)
|
||
|
||
Headers:
|
||
- `Content-Type: application/yaml`(或 `text/yaml`)
|
||
|
||
Response:
|
||
```json
|
||
{
|
||
"task_id": "mvp2-ppo-20251223-143201-7f3a",
|
||
"state": "QUEUED"
|
||
}
|
||
```
|
||
|
||
### 4.2 查询任务(聚合状态)
|
||
|
||
`GET /api/v2/tasks/{task_id}`
|
||
|
||
Response(示例):
|
||
```json
|
||
{
|
||
"task_id": "mvp2-ppo-20251223-143201-7f3a",
|
||
"workload": "ppo",
|
||
"state": "RUNNING",
|
||
"desired_resources": {"nnodes": 2, "n_gpus_per_node": 4, "total_gpus": 8},
|
||
"latest_attempt": {
|
||
"attempt_no": 1,
|
||
"ray_submission_id": "mvp2-ppo-20251223-143201-7f3a--a01",
|
||
"ray_status": "RUNNING",
|
||
"start_time": "2025-12-23T14:32:10+08:00"
|
||
},
|
||
"error_summary": null
|
||
}
|
||
```
|
||
|
||
### 4.3 列出 attempts
|
||
|
||
`GET /api/v2/tasks/{task_id}/attempts`
|
||
|
||
Response:
|
||
```json
|
||
{
|
||
"task_id": "mvp2-ppo-20251223-143201-7f3a",
|
||
"attempts": [
|
||
{
|
||
"attempt_no": 1,
|
||
"ray_submission_id": "mvp2-ppo-20251223-143201-7f3a--a01",
|
||
"ray_status": "FAILED",
|
||
"failure_kind": "INSUFFICIENT_RESOURCES",
|
||
"message": "Total available GPUs 0 is less than total desired GPUs 8",
|
||
"start_time": "...",
|
||
"end_time": "..."
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 4.4 取消任务
|
||
|
||
`POST /api/v2/tasks/{task_id}:cancel`
|
||
|
||
行为:
|
||
- 若 task 处于 `SUBMITTED/RUNNING`:调用 Ray Jobs SDK `stop_job(ray_submission_id)` 并标记 `CANCELED`
|
||
- 若处于 `QUEUED/PENDING_RESOURCES`:直接标记 `CANCELED`(不提交)
|
||
|
||
Response:
|
||
```json
|
||
{"task_id":"...","state":"CANCELED"}
|
||
```
|
||
|
||
### 4.5 获取日志
|
||
|
||
`GET /api/v2/tasks/{task_id}/logs?attempt=latest&tail=2000`
|
||
|
||
返回:
|
||
- `text/plain`(直接透传 Ray Job logs tail)
|
||
|
||
说明:
|
||
- v2.0 先用 Ray SDK `get_job_logs()`。
|
||
- 若需要更稳定的归档,可在 scheduler 定期抓取并落盘(v2.1+)。
|
||
|
||
### 4.6 列出队列(运维/调试)
|
||
|
||
`GET /api/v2/queue`
|
||
|
||
Response:
|
||
```json
|
||
{
|
||
"pending": [{"task_id":"...","state":"PENDING_RESOURCES","next_run_at":"..."}],
|
||
"running": [{"task_id":"...","ray_submission_id":"..."}]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5. 错误码(最小)
|
||
|
||
- `400`:jobspec 缺字段/非法
|
||
- `401`:token 不正确
|
||
- `404`:task 不存在
|
||
- `409`:状态冲突(例如已终态又 cancel)
|
||
- `500`:服务内部错误
|
||
|
||
---
|
||
|
||
## 6. SQLite 持久化(API 可见性)
|
||
|
||
v2.0 服务端使用 SQLite 持久化保存:
|
||
- tasks(`task_id`、`state`、`jobspec_yaml`、`next_run_at`、`latest_attempt_no` 等)
|
||
- attempts(`ray_submission_id`、`ray_status`、失败原因等)
|
||
|
||
因此:
|
||
- `GET /api/v2/tasks/{task_id}` 的数据来自 SQLite(再叠加 Ray 状态同步的结果)。
|
||
- 进程重启后,队列可恢复,`PENDING_RESOURCES` 的任务会在 `next_run_at` 到期后继续尝试提交。
|