argus-cluster/specs/mvp/v3.8/v3.8_api.md

# MVP v3.8 API Reference（Serving）

> 说明：本节为 v3.8 新增的 **Model Serving** API（Ray Serve LLM / vLLM）。
> 认证：Serving 管理 API 复用现有 MVP API 的认证方式（`Authorization: Bearer <user_token>`）。
> 推理：对外 OpenAI endpoint **不做鉴权**（v3.8 约定）。

## 0. 基本信息

### 0.1 Base URLs

- MVP API server：`http://<host>:8080`
- Ray Serve OpenAI ingress（固定端口 8000）：`http://<host>:8000/v1`

### 0.2 认证

所有 `/api/v2/serve/*` 接口要求：

```
Authorization: Bearer <user_token>
```

其中 `user_token` 由管理员通过 `/api/v2/users/<user_id>/tokens` 颁发（沿用现有机制）。

### 0.3 命名规则：`model_id = user_id-YYYYMMDDHHMM-<suffix>`

- 用户提交时填写 `model_id`（语义为 suffix，例如 `qwen-0.5b`）
- 平台生成前缀：
  - `prefix = "<user_id>-<YYYYMMDDHHMM>"`
- 平台实际对外暴露的 OpenAI model 名称为：
  - `model_id = "<prefix>-<suffix>"`
  - 示例：`alice-202601061235-qwen-0.5b`

## 1. 数据结构

### 1.1 ServingSpec（YAML）

请求体建议使用 YAML（与 TaskSpec 一致），示例：

```yaml
model_id: qwen-0.5b                      # 必填：suffix（平台自动加 user_id- 前缀）
model_source: $HOME/common/hf/.../<sha>  # 必填：本地路径或 repo id；平台做 $HOME 宏替换与路径校验
num_replicas: 1                          # 可选，默认 1
gpus_per_replica: 1                      # 可选，默认 1
# engine_kwargs:                         # 可选：vLLM 参数透传（白名单/黑名单由实现决定）
#   max_model_len: 8192
#   gpu_memory_utilization: 0.9
```

说明：
- `accelerator_type` 不在 ServingSpec 中暴露；由平台配置（`dev.yaml` 的 `serving.llm.accelerator_type`）统一注入到 Ray Serve LLM 的 `LLMConfig.accelerator_type`（dev/h1: `H20`）。

#### 宏替换

- `$HOME` → `/private/users/<user_id>`
- `$HOME/common/hf` → `/private/hf`
- `$HOME/common/datasets` → `/private/datasets`（serving 不强依赖，但保留一致语义）

#### 路径校验（v3.8 约定）

`model_source` 允许：

- `/private/hf/...`（common）
- `/private/users/<user_id>/...`（user）

拒绝：

- 其它用户目录
- 非 `/private` 下路径
- 空路径或包含 `..` 的可疑路径

### 1.2 ServingModel（响应体，JSON）

```json
{
  "model_key": "svc-alice-20260106-123000-abcd",
  "user_id": "alice",
  "model_id": "alice-202601061235-qwen-0.5b",
  "model_id_suffix": "qwen-0.5b",
  "model_id_prefix": "alice-202601061235",
  "model_source": "/private/hf/hub/models--.../snapshots/<sha>",
  "num_replicas": 1,
  "gpus_per_replica": 1,
  "total_gpus": 1,
  "state": "RUNNING",
  "endpoint": {
    "openai_base_url": "http://<host>:8000/v1",
    "model": "alice-202601061235-qwen-0.5b"
  },
  "error_summary": null,
  "created_at": "2026-01-06T12:30:00Z",
  "updated_at": "2026-01-06T12:31:02Z"
}
```

## 2. 管理 API（MVP API server）

### 2.1 Create / Upsert model

`POST /api/v2/serve/models`

#### Request

- Header: `Content-Type: application/yaml`
- Body: ServingSpec（YAML）

#### Response (202)

```json
{
  "model_key": "svc-alice-20260106-123000-abcd",
  "state": "QUEUED"
}
```

语义：
- 创建新模型（若 suffix 不存在）
- 或更新已有模型（若同一用户同一 suffix 已存在）：更新 replicas/gpu 等配置，进入 `QUEUED` 等待 reconciler apply

### 2.2 List models (current user)

`GET /api/v2/serve/models`

#### Response (200)

```json
{
  "items": [ ... ServingModel ... ],
  "openai_base_url": "http://<host>:8000/v1"
}
```

### 2.3 Get model detail

`GET /api/v2/serve/models/{model_key}`

#### Response (200)

```json
{
  "model": { ... ServingModel ... },
  "resolved_spec_yaml": "model_id: ...\nmodel_source: ...\n",
  "events": [
    { "event_type": "DEPLOY_REQUESTED", "created_at": "...", "payload": {...} }
  ],
  "serve_status": {
    "app_name": "argus_llm_app",
    "app_status": "RUNNING"
  }
}
```

### 2.4 Scale replicas (PATCH)

`PATCH /api/v2/serve/models/{model_key}`

#### Request (JSON)

```json
{ "num_replicas": 2 }
```

#### Response (200)

```json
{ "model_key": "...", "state": "QUEUED" }
```

> v3.8 只支持修改 `num_replicas`（以及可选 engine_kwargs）；`gpus_per_replica` 若修改，可能触发重新部署。

### 2.5 Delete / Undeploy model

`DELETE /api/v2/serve/models/{model_key}`

#### Response (200)

```json
{ "model_key": "...", "state": "DELETING" }
```

语义：从“声明式配置”中删除该模型，reconciler 会在下一轮 tick 触发 `serve.run(...)` 更新 app 配置并最终使其不可见。

### 2.6 Admin: Serve cluster status（可选）

`GET /api/v2/serve/status`

#### Response (200)

返回 `serve.status()` 摘要（集群级 + app 级）。

> 仅 admin token 可访问（沿用 v3.x admin gate）。

## 3. 推理 API（Ray Serve OpenAI ingress）

> v3.8 不做鉴权：无需 `Authorization`。

### 3.1 List models

`GET http://<host>:8000/v1/models`

返回可用 model 列表（包含 `alice-qwen-0.5b` 这类带前缀名称）。

### 3.2 Chat completions

`POST http://<host>:8000/v1/chat/completions`

```json
{
  "model": "alice-202601061235-qwen-0.5b",
  "messages": [{"role":"user","content":"Hello"}],
  "stream": false
}
```

### 3.3 Completions / Embeddings

按 Ray Serve LLM OpenAI ingress 支持范围提供（v3.8 验收至少覆盖 chat）。

## 4. 错误码约定（MVP API server）

- `400 invalid yaml/spec`：YAML 解析失败、字段缺失、值不合法
- `403 forbidden`：路径越权（model_source 访问其他用户目录）
- `409 conflict`：model_id_suffix 冲突（同一用户重复创建且不允许覆盖时；若选择 upsert 则不返回该错误）
- `422 unprocessable`：资源参数非法（replica/gpu <=0）
- `500 internal`：reconciler/serve 调用异常（详情记录到 `serve_events`，并写入 `error_summary`）