225 lines
5.8 KiB
Markdown
225 lines
5.8 KiB
Markdown
# MVP v3.8 API Reference(Serving)
|
||
|
||
> 说明:本节为 v3.8 新增的 **Model Serving** API(Ray Serve LLM / vLLM)。
|
||
> 认证:Serving 管理 API 复用现有 MVP API 的认证方式(`Authorization: Bearer <user_token>`)。
|
||
> 推理:对外 OpenAI endpoint **不做鉴权**(v3.8 约定)。
|
||
|
||
## 0. 基本信息
|
||
|
||
### 0.1 Base URLs
|
||
|
||
- MVP API server:`http://<host>:8080`
|
||
- Ray Serve OpenAI ingress(固定端口 8000):`http://<host>:8000/v1`
|
||
|
||
### 0.2 认证
|
||
|
||
所有 `/api/v2/serve/*` 接口要求:
|
||
|
||
```
|
||
Authorization: Bearer <user_token>
|
||
```
|
||
|
||
其中 `user_token` 由管理员通过 `/api/v2/users/<user_id>/tokens` 颁发(沿用现有机制)。
|
||
|
||
### 0.3 命名规则:`model_id = user_id-YYYYMMDDHHMM-<suffix>`
|
||
|
||
- 用户提交时填写 `model_id`(语义为 suffix,例如 `qwen-0.5b`)
|
||
- 平台生成前缀:
|
||
- `prefix = "<user_id>-<YYYYMMDDHHMM>"`
|
||
- 平台实际对外暴露的 OpenAI model 名称为:
|
||
- `model_id = "<prefix>-<suffix>"`
|
||
- 示例:`alice-202601061235-qwen-0.5b`
|
||
|
||
## 1. 数据结构
|
||
|
||
### 1.1 ServingSpec(YAML)
|
||
|
||
请求体建议使用 YAML(与 TaskSpec 一致),示例:
|
||
|
||
```yaml
|
||
model_id: qwen-0.5b # 必填:suffix(平台自动加 user_id- 前缀)
|
||
model_source: $HOME/common/hf/.../<sha> # 必填:本地路径或 repo id;平台做 $HOME 宏替换与路径校验
|
||
num_replicas: 1 # 可选,默认 1
|
||
gpus_per_replica: 1 # 可选,默认 1
|
||
# engine_kwargs: # 可选:vLLM 参数透传(白名单/黑名单由实现决定)
|
||
# max_model_len: 8192
|
||
# gpu_memory_utilization: 0.9
|
||
```
|
||
|
||
说明:
|
||
- `accelerator_type` 不在 ServingSpec 中暴露;由平台配置(`dev.yaml` 的 `serving.llm.accelerator_type`)统一注入到 Ray Serve LLM 的 `LLMConfig.accelerator_type`(dev/h1: `H20`)。
|
||
|
||
#### 宏替换
|
||
|
||
- `$HOME` → `/private/users/<user_id>`
|
||
- `$HOME/common/hf` → `/private/hf`
|
||
- `$HOME/common/datasets` → `/private/datasets`(serving 不强依赖,但保留一致语义)
|
||
|
||
#### 路径校验(v3.8 约定)
|
||
|
||
`model_source` 允许:
|
||
|
||
- `/private/hf/...`(common)
|
||
- `/private/users/<user_id>/...`(user)
|
||
|
||
拒绝:
|
||
|
||
- 其它用户目录
|
||
- 非 `/private` 下路径
|
||
- 空路径或包含 `..` 的可疑路径
|
||
|
||
### 1.2 ServingModel(响应体,JSON)
|
||
|
||
```json
|
||
{
|
||
"model_key": "svc-alice-20260106-123000-abcd",
|
||
"user_id": "alice",
|
||
"model_id": "alice-202601061235-qwen-0.5b",
|
||
"model_id_suffix": "qwen-0.5b",
|
||
"model_id_prefix": "alice-202601061235",
|
||
"model_source": "/private/hf/hub/models--.../snapshots/<sha>",
|
||
"num_replicas": 1,
|
||
"gpus_per_replica": 1,
|
||
"total_gpus": 1,
|
||
"state": "RUNNING",
|
||
"endpoint": {
|
||
"openai_base_url": "http://<host>:8000/v1",
|
||
"model": "alice-202601061235-qwen-0.5b"
|
||
},
|
||
"error_summary": null,
|
||
"created_at": "2026-01-06T12:30:00Z",
|
||
"updated_at": "2026-01-06T12:31:02Z"
|
||
}
|
||
```
|
||
|
||
## 2. 管理 API(MVP API server)
|
||
|
||
### 2.1 Create / Upsert model
|
||
|
||
`POST /api/v2/serve/models`
|
||
|
||
#### Request
|
||
|
||
- Header: `Content-Type: application/yaml`
|
||
- Body: ServingSpec(YAML)
|
||
|
||
#### Response (202)
|
||
|
||
```json
|
||
{
|
||
"model_key": "svc-alice-20260106-123000-abcd",
|
||
"state": "QUEUED"
|
||
}
|
||
```
|
||
|
||
语义:
|
||
- 创建新模型(若 suffix 不存在)
|
||
- 或更新已有模型(若同一用户同一 suffix 已存在):更新 replicas/gpu 等配置,进入 `QUEUED` 等待 reconciler apply
|
||
|
||
### 2.2 List models (current user)
|
||
|
||
`GET /api/v2/serve/models`
|
||
|
||
#### Response (200)
|
||
|
||
```json
|
||
{
|
||
"items": [ ... ServingModel ... ],
|
||
"openai_base_url": "http://<host>:8000/v1"
|
||
}
|
||
```
|
||
|
||
### 2.3 Get model detail
|
||
|
||
`GET /api/v2/serve/models/{model_key}`
|
||
|
||
#### Response (200)
|
||
|
||
```json
|
||
{
|
||
"model": { ... ServingModel ... },
|
||
"resolved_spec_yaml": "model_id: ...\nmodel_source: ...\n",
|
||
"events": [
|
||
{ "event_type": "DEPLOY_REQUESTED", "created_at": "...", "payload": {...} }
|
||
],
|
||
"serve_status": {
|
||
"app_name": "argus_llm_app",
|
||
"app_status": "RUNNING"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2.4 Scale replicas (PATCH)
|
||
|
||
`PATCH /api/v2/serve/models/{model_key}`
|
||
|
||
#### Request (JSON)
|
||
|
||
```json
|
||
{ "num_replicas": 2 }
|
||
```
|
||
|
||
#### Response (200)
|
||
|
||
```json
|
||
{ "model_key": "...", "state": "QUEUED" }
|
||
```
|
||
|
||
> v3.8 只支持修改 `num_replicas`(以及可选 engine_kwargs);`gpus_per_replica` 若修改,可能触发重新部署。
|
||
|
||
### 2.5 Delete / Undeploy model
|
||
|
||
`DELETE /api/v2/serve/models/{model_key}`
|
||
|
||
#### Response (200)
|
||
|
||
```json
|
||
{ "model_key": "...", "state": "DELETING" }
|
||
```
|
||
|
||
语义:从“声明式配置”中删除该模型,reconciler 会在下一轮 tick 触发 `serve.run(...)` 更新 app 配置并最终使其不可见。
|
||
|
||
### 2.6 Admin: Serve cluster status(可选)
|
||
|
||
`GET /api/v2/serve/status`
|
||
|
||
#### Response (200)
|
||
|
||
返回 `serve.status()` 摘要(集群级 + app 级)。
|
||
|
||
> 仅 admin token 可访问(沿用 v3.x admin gate)。
|
||
|
||
## 3. 推理 API(Ray Serve OpenAI ingress)
|
||
|
||
> v3.8 不做鉴权:无需 `Authorization`。
|
||
|
||
### 3.1 List models
|
||
|
||
`GET http://<host>:8000/v1/models`
|
||
|
||
返回可用 model 列表(包含 `alice-qwen-0.5b` 这类带前缀名称)。
|
||
|
||
### 3.2 Chat completions
|
||
|
||
`POST http://<host>:8000/v1/chat/completions`
|
||
|
||
```json
|
||
{
|
||
"model": "alice-202601061235-qwen-0.5b",
|
||
"messages": [{"role":"user","content":"Hello"}],
|
||
"stream": false
|
||
}
|
||
```
|
||
|
||
### 3.3 Completions / Embeddings
|
||
|
||
按 Ray Serve LLM OpenAI ingress 支持范围提供(v3.8 验收至少覆盖 chat)。
|
||
|
||
## 4. 错误码约定(MVP API server)
|
||
|
||
- `400 invalid yaml/spec`:YAML 解析失败、字段缺失、值不合法
|
||
- `403 forbidden`:路径越权(model_source 访问其他用户目录)
|
||
- `409 conflict`:model_id_suffix 冲突(同一用户重复创建且不允许覆盖时;若选择 upsert 则不返回该错误)
|
||
- `422 unprocessable`:资源参数非法(replica/gpu <=0)
|
||
- `500 internal`:reconciler/serve 调用异常(详情记录到 `serve_events`,并写入 `error_summary`)
|