argus-cluster/specs/mvp/v3.8/v3.8_api.md

225 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MVP v3.8 API ReferenceServing
> 说明:本节为 v3.8 新增的 **Model Serving** APIRay Serve LLM / vLLM
> 认证Serving 管理 API 复用现有 MVP API 的认证方式(`Authorization: Bearer <user_token>`)。
> 推理:对外 OpenAI endpoint **不做鉴权**v3.8 约定)。
## 0. 基本信息
### 0.1 Base URLs
- MVP API server`http://<host>:8080`
- Ray Serve OpenAI ingress固定端口 8000`http://<host>:8000/v1`
### 0.2 认证
所有 `/api/v2/serve/*` 接口要求:
```
Authorization: Bearer <user_token>
```
其中 `user_token` 由管理员通过 `/api/v2/users/<user_id>/tokens` 颁发(沿用现有机制)。
### 0.3 命名规则:`model_id = user_id-YYYYMMDDHHMM-<suffix>`
- 用户提交时填写 `model_id`(语义为 suffix例如 `qwen-0.5b`
- 平台生成前缀:
- `prefix = "<user_id>-<YYYYMMDDHHMM>"`
- 平台实际对外暴露的 OpenAI model 名称为:
- `model_id = "<prefix>-<suffix>"`
- 示例:`alice-202601061235-qwen-0.5b`
## 1. 数据结构
### 1.1 ServingSpecYAML
请求体建议使用 YAML与 TaskSpec 一致),示例:
```yaml
model_id: qwen-0.5b # 必填suffix平台自动加 user_id- 前缀)
model_source: $HOME/common/hf/.../<sha> # 必填:本地路径或 repo id平台做 $HOME 宏替换与路径校验
num_replicas: 1 # 可选,默认 1
gpus_per_replica: 1 # 可选,默认 1
# engine_kwargs: # 可选vLLM 参数透传(白名单/黑名单由实现决定)
# max_model_len: 8192
# gpu_memory_utilization: 0.9
```
说明:
- `accelerator_type` 不在 ServingSpec 中暴露;由平台配置(`dev.yaml``serving.llm.accelerator_type`)统一注入到 Ray Serve LLM 的 `LLMConfig.accelerator_type`dev/h1: `H20`)。
#### 宏替换
- `$HOME``/private/users/<user_id>`
- `$HOME/common/hf``/private/hf`
- `$HOME/common/datasets``/private/datasets`serving 不强依赖,但保留一致语义)
#### 路径校验v3.8 约定)
`model_source` 允许:
- `/private/hf/...`common
- `/private/users/<user_id>/...`user
拒绝:
- 其它用户目录
-`/private` 下路径
- 空路径或包含 `..` 的可疑路径
### 1.2 ServingModel响应体JSON
```json
{
"model_key": "svc-alice-20260106-123000-abcd",
"user_id": "alice",
"model_id": "alice-202601061235-qwen-0.5b",
"model_id_suffix": "qwen-0.5b",
"model_id_prefix": "alice-202601061235",
"model_source": "/private/hf/hub/models--.../snapshots/<sha>",
"num_replicas": 1,
"gpus_per_replica": 1,
"total_gpus": 1,
"state": "RUNNING",
"endpoint": {
"openai_base_url": "http://<host>:8000/v1",
"model": "alice-202601061235-qwen-0.5b"
},
"error_summary": null,
"created_at": "2026-01-06T12:30:00Z",
"updated_at": "2026-01-06T12:31:02Z"
}
```
## 2. 管理 APIMVP API server
### 2.1 Create / Upsert model
`POST /api/v2/serve/models`
#### Request
- Header: `Content-Type: application/yaml`
- Body: ServingSpecYAML
#### Response (202)
```json
{
"model_key": "svc-alice-20260106-123000-abcd",
"state": "QUEUED"
}
```
语义:
- 创建新模型(若 suffix 不存在)
- 或更新已有模型(若同一用户同一 suffix 已存在):更新 replicas/gpu 等配置,进入 `QUEUED` 等待 reconciler apply
### 2.2 List models (current user)
`GET /api/v2/serve/models`
#### Response (200)
```json
{
"items": [ ... ServingModel ... ],
"openai_base_url": "http://<host>:8000/v1"
}
```
### 2.3 Get model detail
`GET /api/v2/serve/models/{model_key}`
#### Response (200)
```json
{
"model": { ... ServingModel ... },
"resolved_spec_yaml": "model_id: ...\nmodel_source: ...\n",
"events": [
{ "event_type": "DEPLOY_REQUESTED", "created_at": "...", "payload": {...} }
],
"serve_status": {
"app_name": "argus_llm_app",
"app_status": "RUNNING"
}
}
```
### 2.4 Scale replicas (PATCH)
`PATCH /api/v2/serve/models/{model_key}`
#### Request (JSON)
```json
{ "num_replicas": 2 }
```
#### Response (200)
```json
{ "model_key": "...", "state": "QUEUED" }
```
> v3.8 只支持修改 `num_replicas`(以及可选 engine_kwargs`gpus_per_replica` 若修改,可能触发重新部署。
### 2.5 Delete / Undeploy model
`DELETE /api/v2/serve/models/{model_key}`
#### Response (200)
```json
{ "model_key": "...", "state": "DELETING" }
```
语义从“声明式配置”中删除该模型reconciler 会在下一轮 tick 触发 `serve.run(...)` 更新 app 配置并最终使其不可见。
### 2.6 Admin: Serve cluster status可选
`GET /api/v2/serve/status`
#### Response (200)
返回 `serve.status()` 摘要(集群级 + app 级)。
> 仅 admin token 可访问(沿用 v3.x admin gate
## 3. 推理 APIRay Serve OpenAI ingress
> v3.8 不做鉴权:无需 `Authorization`。
### 3.1 List models
`GET http://<host>:8000/v1/models`
返回可用 model 列表(包含 `alice-qwen-0.5b` 这类带前缀名称)。
### 3.2 Chat completions
`POST http://<host>:8000/v1/chat/completions`
```json
{
"model": "alice-202601061235-qwen-0.5b",
"messages": [{"role":"user","content":"Hello"}],
"stream": false
}
```
### 3.3 Completions / Embeddings
按 Ray Serve LLM OpenAI ingress 支持范围提供v3.8 验收至少覆盖 chat
## 4. 错误码约定MVP API server
- `400 invalid yaml/spec`YAML 解析失败、字段缺失、值不合法
- `403 forbidden`路径越权model_source 访问其他用户目录)
- `409 conflict`model_id_suffix 冲突(同一用户重复创建且不允许覆盖时;若选择 upsert 则不返回该错误)
- `422 unprocessable`资源参数非法replica/gpu <=0
- `500 internal`reconciler/serve 调用异常(详情记录到 `serve_events`,并写入 `error_summary`