v3.0 开发验证完成,满足webui和最小数据管理需求, 任务模版目前偏简单,可以扩展一下,再release开放使用
This commit is contained in:
parent
7b89d60b3b
commit
6d3fefc7a6
BIN
specs/mvp/image/Snipaste_2025-12-30_16-06-37.png
Normal file
BIN
specs/mvp/image/Snipaste_2025-12-30_16-06-37.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 76 KiB |
BIN
specs/mvp/image/Snipaste_2025-12-30_17-02-20.png
Normal file
BIN
specs/mvp/image/Snipaste_2025-12-30_17-02-20.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 52 KiB |
BIN
specs/mvp/image/roadmap_v3.0.png
Normal file
BIN
specs/mvp/image/roadmap_v3.0.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 65 KiB |
15
specs/mvp/v3.0/README.md
Normal file
15
specs/mvp/v3.0/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# MVP v3.0(Design)— WebUI + 用户数据上传/下载(SFTPGo)→ 首个可发布版本
|
||||
|
||||
本目录基于:
|
||||
- `specs/mvp/mvp_roadmap_v2.md`(总体路线图)
|
||||
- `specs/mvp/image/roadmap_v3.0.png`(v3.0 迭代图)
|
||||
- 当前已落地的 v2.5(User Mgmt + Stateless Ray Node Pool)
|
||||
|
||||
目标是在 v2.5 的基础上补齐 **用户数据闭环**(上传→训练可见→产物下载)以及最小可用的 **WebUI**,形成“可发布”的 v3.0 版本。
|
||||
|
||||
文档:
|
||||
- `specs/mvp/v3.0/v3.0_design.md`:总体架构与关键机制(WebUI、SFTPGo、数据/权限模型、任务流)。
|
||||
- `specs/mvp/v3.0/v3.0_api.md`:v3.0 API 扩展设计(UI、数据、SFTPGo 管理、权限约束)。
|
||||
- `specs/mvp/v3.0/v3.0_acceptance.md`:部署/升级/验收流程与可验证标准(含故障注入与回归清单)。
|
||||
- `specs/mvp/v3.0/v3.0_dev_plan.md`:TDD 驱动的工程化开发计划(里程碑拆分、测试分层、E2E 验收)。
|
||||
- `specs/mvp/v3.0/v3.0_progress.md`:实施进展记录(每个里程碑完成后追加记录)。
|
||||
55
specs/mvp/v3.0/v3.0_acceptance.md
Normal file
55
specs/mvp/v3.0/v3.0_acceptance.md
Normal file
@ -0,0 +1,55 @@
|
||||
# MVP v3.0 — 部署与验收流程(草案)
|
||||
|
||||
## 0) 环境前提
|
||||
- Ray 集群:延续 v2.5 的 head + stateless worker(自动 join)
|
||||
- 共享存储:容器内挂载 `/private`(dev/prod 对齐)
|
||||
- API server:宿主机代码挂载到 head 容器,在 head 容器内启动
|
||||
- 新增:SFTPGo 服务(建议容器化部署)
|
||||
|
||||
## 1) 部署步骤(高层)
|
||||
|
||||
1) 部署/升级 Ray 节点镜像(沿用 v2.5 的 `argus/argus-ray-node:v2.5` 或更高版本)
|
||||
2) 启动 Ray 集群(compose 或平台创建容器)
|
||||
3) 启动/配置 SFTPGo(挂载 `/private`)
|
||||
4) 启动 API server(head 容器内)
|
||||
5) 启动 WebUI(由 API server 托管)
|
||||
|
||||
## 2) 验收用例(必须通过)
|
||||
|
||||
### A. 用户与凭据
|
||||
1) admin 创建用户 `alice`,签发 API token
|
||||
2) 系统联动在 SFTPGo 创建 `alice`(home=/private/users/alice)
|
||||
3) `alice` 使用 token 登录 WebUI(或调用 `/api/v2/me` 成功)
|
||||
|
||||
### B. 上传数据闭环(核心)
|
||||
1) `alice` 通过 SFTP 上传数据集到 `/private/users/alice/datasets/...`
|
||||
2) `alice` 通过 WebUI/API 提交任务,TaskSpec 引用该路径
|
||||
3) Ray worker 读取该数据,任务 RUNNING 并最终 SUCCEEDED
|
||||
|
||||
### C. 下载产物闭环
|
||||
1) 训练完成后,产物落到 `/private/users/alice/jobs/<submission_id>/...`
|
||||
2) `alice` 通过 SFTP 下载 checkpoints/logs 成功
|
||||
3) (新增)`alice` 将需要长期保留的权重从 `jobs/<submission_id>/...` 移动到 `models/`,确认移动后可长期存在
|
||||
|
||||
### C2. Jobs 回收站与自动清理(3 天移入回收站,7 天后永久删除)
|
||||
1) 将 `jobs_trash_after_days`/`jobs_purge_after_days` 配置为较小值(例如分钟级,用于验证)
|
||||
2) 训练完成进入 terminal 状态
|
||||
3) 等待 API server 内置 janitor 扫描周期后,确认对应 `jobs/<submission_id>` 被移动到 `trash/jobs/<submission_id>`
|
||||
4) 在回收站窗口内,把某个文件从 `trash/jobs/<submission_id>` 移动到 `models/`,确认移动成功
|
||||
5) 等待超过 `jobs_purge_after_days` 后,确认 `trash/jobs/<submission_id>` 被永久删除
|
||||
6) 确认已移动到 `models/` 的文件不被删除
|
||||
|
||||
### D. 安全隔离(必须)
|
||||
1) `bob` 不能通过 API 查询 `alice` 的 task(404)
|
||||
2) `bob` 不能提交引用 `/private/users/alice/...` 的 TaskSpec(400/403)
|
||||
3) `bob` 通过 SFTP 无法访问 `/private/users/alice/...`(chroot 生效)
|
||||
|
||||
## 3) 故障注入(推荐通过)
|
||||
1) kill worker watchdog 或 raylet → worker 自动恢复并重新加入集群
|
||||
2) 重启 head 容器 → head 重新写 `head.json`,worker 自动重连
|
||||
3) SFTPGo 重启 → 不影响 Ray 集群;用户可重新连接上传/下载
|
||||
|
||||
## 4) 回归清单(与 v2.5 一致)
|
||||
- 任务队列、重试(INSUFFICIENT_RESOURCES → PENDING_RESOURCES → retry)
|
||||
- PPO/GRPO/SFT 三种 workload 均可跑通
|
||||
- head 不跑训练(driver 强制落 worker)
|
||||
109
specs/mvp/v3.0/v3.0_api.md
Normal file
109
specs/mvp/v3.0/v3.0_api.md
Normal file
@ -0,0 +1,109 @@
|
||||
# MVP v3.0 — API 扩展设计(基于 v2.5)
|
||||
|
||||
v3.0 的原则是:**尽量复用 v2.5 API**,只增量增加 “数据闭环” 与 “WebUI 支持” 所需的最小接口。
|
||||
|
||||
## 1) 认证与权限
|
||||
|
||||
沿用 v2.5:
|
||||
- Header:`Authorization: Bearer <token>`
|
||||
- admin token:来自 `MVP_INTERNAL_TOKEN`
|
||||
- 普通用户 token:由 admin 颁发并持久化在 SQLite
|
||||
|
||||
权限规则:
|
||||
- 非 admin:只能访问自己的 task、自己的数据空间(`/private/users/<user_id>/...`)。
|
||||
- 跨用户访问返回 404(不泄露存在性)。
|
||||
|
||||
## 2) 用户与 SFTPGo 联动(管理员接口)
|
||||
|
||||
### 2.1 创建用户(复用 v2.5)
|
||||
`POST /api/v2/users`
|
||||
- v3.0 行为:成功后,**可选**联动创建 SFTPGo 用户
|
||||
- v3.0 默认启用联动:创建 SFTPGo 用户 + 生成一次性密码(password 认证)
|
||||
- v3.0 仅保留该方案(方案 A):不做外部认证/SSO 集成(留到更后续版本)
|
||||
- `data.sftpgo.admin_api_base` 推荐形如:`http://argus-sftpgo:8080/api/v2`(包含 `/api/v2` 前缀)
|
||||
|
||||
### 2.2 下发 token(复用 v2.5)
|
||||
`POST /api/v2/users/{user_id}/tokens`
|
||||
|
||||
### 2.3 禁用用户(复用 v2.5)
|
||||
`POST /api/v2/users/{user_id}:disable`
|
||||
- v3.0 行为:联动禁用 SFTPGo 用户(可选)
|
||||
|
||||
### 2.4 SFTP 凭据管理(新增,管理员或用户自助)
|
||||
(具体由你确认 v3.0 需要“用户自助”还是“管理员操作”)
|
||||
|
||||
#### 重置 SFTP 密码(管理员)
|
||||
`POST /api/v2/users/{user_id}/sftp:reset_password`
|
||||
- 返回:一次性密码(只返回一次,服务端不保存明文)
|
||||
> v3.0 先只做 password 方案;SSH public key 作为后续版本可选增强(不在 v3.0 范围)。
|
||||
|
||||
## 3) 用户自助信息(新增)
|
||||
|
||||
### 3.1 获取当前用户信息
|
||||
`GET /api/v2/me`
|
||||
- 返回示例:
|
||||
```json
|
||||
{
|
||||
"user_id": "alice",
|
||||
"is_admin": false,
|
||||
"paths": {
|
||||
"home": "/private/users/alice",
|
||||
"datasets": "/private/users/alice/datasets",
|
||||
"models": "/private/users/alice/models",
|
||||
"code": "/private/users/alice/code",
|
||||
"jobs": "/private/users/alice/jobs",
|
||||
"trash_jobs": "/private/users/alice/trash/jobs"
|
||||
},
|
||||
"retention": {
|
||||
"jobs_trash_after_days": 3,
|
||||
"jobs_purge_after_days": 7
|
||||
},
|
||||
"sftp": {
|
||||
"host": "h1.example.internal",
|
||||
"port": 2022,
|
||||
"username": "alice"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 3.2 Jobs Retention 提示(新增)
|
||||
为了支撑 WebUI 展示与用户预期管理,可在 `/api/v2/me` 或单独接口返回:
|
||||
- `jobs_trash_after_days`:默认 3
|
||||
- `jobs_purge_after_days`:默认 7
|
||||
- `jobs_root`:`/private/users/<me>/jobs`
|
||||
- `trash_jobs_root`:`/private/users/<me>/trash/jobs`
|
||||
- `recommendations`:提示用户把需要长期保存的产物移动到 `models/` 或 `datasets/`
|
||||
|
||||
## 4) 数据浏览/下载(可选,v3.0 最小化)
|
||||
|
||||
说明:上传/下载主通道仍是 SFTP。
|
||||
WebUI 如果要提供“快速浏览/查看”,可实现只读接口(避免实现大文件上传/断点等复杂逻辑)。
|
||||
|
||||
### 4.1 列目录
|
||||
`GET /api/v2/files?path=/private/users/alice`
|
||||
- 权限:path 必须在 `/private/common/` 或 `/private/users/<me>/` 下
|
||||
- 返回:文件列表(name/type/size/mtime)
|
||||
|
||||
### 4.2 下载文件(小文件为主)
|
||||
`GET /api/v2/files:download?path=/private/users/alice/jobs/.../logs/...`
|
||||
- 返回:流式下载
|
||||
- 大文件仍建议走 SFTP
|
||||
|
||||
## 5) TaskSpec 路径校验升级(v3.0 关键)
|
||||
|
||||
v2.5:仅允许 `/private/common/...`
|
||||
v3.0:允许 `/private/common/...` 与 `/private/users/<me>/...`
|
||||
|
||||
应用字段(至少):
|
||||
- `train_file` / `val_file`
|
||||
- `code_path`:仍仅允许 `/private/common/...`(v3.0 不支持执行用户 code)
|
||||
- 本地模型路径字段(如果引入):允许 `/private/users/<me>/models/...`
|
||||
|
||||
## 6) WebUI 路由(新增)
|
||||
|
||||
由 API server 托管:
|
||||
- `GET /ui`:主页面
|
||||
- `GET /ui/login`:token 登录页
|
||||
- 静态资源:`/ui/static/...`
|
||||
|
||||
WebUI 的所有操作均调用同源 API(不额外开 CORS)。
|
||||
358
specs/mvp/v3.0/v3.0_design.md
Normal file
358
specs/mvp/v3.0/v3.0_design.md
Normal file
@ -0,0 +1,358 @@
|
||||
# MVP v3.0 详细设计方案(基于 v2.5)
|
||||
|
||||
## 0. 结论摘要(v3.0 要交付什么)
|
||||
|
||||
v3.0 = v2.5 + **WebUI** + **用户数据上传/下载(SFTPGo)**,形成第一个可对外发布的版本:
|
||||
- 用户可以通过 **SFTP** 上传数据/模型/代码(至少数据),落到 GPFS(容器内 `/private`)并对 Ray worker 可见。
|
||||
- 用户可以通过 API/WebUI 提交训练任务,任务读取自己上传的数据。
|
||||
- 用户可以下载训练产物(checkpoints/logs 等),最小闭环跑通。
|
||||
|
||||
## 1. 范围与原则
|
||||
|
||||
### 1.1 继承 v2.5 的前提(不回退)
|
||||
- **Stateless Ray Node Pool**:head 写 `head.json`,worker watchdog 自动 join/自愈。
|
||||
- **User Management**:token 鉴权、任务可见性隔离(跨用户 404 不泄漏)。
|
||||
- **作业产物隔离**:Ray job 目录落到 `/private/users/<user_id>/jobs/<ray_submission_id>/...`。
|
||||
- **API server 短期运行方式**:代码在宿主机,挂载到 head 容器,在 head 容器内启动(保持现状)。
|
||||
|
||||
### 1.2 v3.0 新增目标
|
||||
1) **Data Management(SFTPGo)**
|
||||
- 提供用户上传/下载入口(SFTP 为主)。
|
||||
- 数据落到 GPFS(dev 环境 NFS/GPFS,生产环境 GPFS),训练 job 在 worker 容器内可直接读取。
|
||||
2) **WebUI**
|
||||
- 用户可视化创建任务、查看队列/状态/日志、查看“数据路径约定”和自己的 SFTP 信息。
|
||||
- 目标是 “可用而非豪华”,支持核心工作流。
|
||||
3) **权限闭环**
|
||||
- 用户只能使用自己目录下的数据(`/private/users/<user_id>/...`)或公共目录(`/private/common/...`)。
|
||||
- 防止用户提交任务读取其他用户的文件路径。
|
||||
|
||||
### 1.3 v3.0 明确不做(留给 v3.5)
|
||||
- 不做 “自定义 reward function / 自定义 verl 代码 / 多版本 verl 共存”(路线图 v3.5)。
|
||||
- 不做复杂 Serving/训推一体(路线图 v3.5)。
|
||||
- 不做 IB 网络/拓扑优化(路线图 v3.5)。
|
||||
- 不做系统级可观测性平台(路线图 v4.0)。
|
||||
|
||||
## 2. 架构概览
|
||||
|
||||
参考 `roadmap_v3.0.png`,v3.0 的控制面与数据面:
|
||||
|
||||
### 2.1 控制面(Control Plane)
|
||||
- **API Server(FastAPI)**
|
||||
- v2.5 的任务队列/调度/重试 + 用户管理能力继续复用
|
||||
- 新增:数据管理能力(与 SFTPGo 对接) + WebUI
|
||||
- **WebUI**
|
||||
- 通过 API 使用 token 登录
|
||||
- 提供任务/日志/数据入口(不直接运行训练)
|
||||
- **Ray Head(状态节点)**
|
||||
- 仍在 head 容器内(或单独节点)
|
||||
- job server/dashbaord 提供 job submit/status/logs 能力
|
||||
|
||||
### 2.2 数据面(Data Plane)
|
||||
- **GPFS(容器内挂载 `/private`)**
|
||||
- 存放 common 与 users 两大根目录
|
||||
- **Ray Worker Node(无状态)**
|
||||
- 自动连接 head,执行训练
|
||||
- 读取 `/private/users/<user>/...` 的数据
|
||||
|
||||
### 2.3 新增组件:SFTPGo(Data Management)
|
||||
- 作为独立服务运行(容器化优先),后端存储使用 **filesystem**(GPFS 挂载路径)。
|
||||
- 用户的 home directory 指向 `/private/users/<user_id>`(或其子目录)。
|
||||
|
||||
## 3. 存储与目录规范(v3.0 统一约定)
|
||||
|
||||
### 3.1 目录层级
|
||||
统一以容器内 `/private` 作为根路径(dev/prod 对齐):
|
||||
- `/private/common/`:公共资源
|
||||
- `hf/`:HF cache
|
||||
- `datasets/`:公共数据集(可选)
|
||||
- `code/`:公共代码(例如公共 verl repo snapshot)
|
||||
- `db/`:SQLite(队列、用户、token)
|
||||
- `logs/`:API/supervisor/watchdog 日志
|
||||
- `/private/users/<user_id>/`:用户空间(v3.0 重点)
|
||||
- `datasets/`:用户上传的数据集(推荐)
|
||||
- `models/`:用户保存/上传的本地模型(允许;也用于“把 job 产物移动到长期保存目录”)
|
||||
- `code/`:用户上传的代码(v3.0 **不支持执行**;仅存放/下载)
|
||||
- `jobs/`:训练任务产物(已在 v2.5 落地)
|
||||
- `tmp/`:临时文件(可选)
|
||||
|
||||
### 3.2 Jobs Retention(两段式:3 天移入回收站,7 天后永久删除)
|
||||
v3.0 引入 **jobs 目录两段式保留策略**:
|
||||
- 第 1 阶段(soft-delete):job 结束后 **3 天**,将该 job 目录从 `jobs/` **移动到用户回收目录**;
|
||||
- 第 2 阶段(hard-delete):进入回收目录后再过 **7 天**,从回收目录 **永久删除**。
|
||||
|
||||
目录约定(建议):
|
||||
- jobs 根目录:`/private/users/<user_id>/jobs/<ray_submission_id>/...`
|
||||
- 回收目录:`/private/users/<user_id>/trash/jobs/<ray_submission_id>/...`
|
||||
|
||||
计时规则:
|
||||
- 以 job 进入 terminal 状态(SUCCEEDED/FAILED/CANCELED)的结束时间为起点;
|
||||
- “3 天”用于从 `jobs/` 移入 `trash/jobs/`;
|
||||
- “7 天”用于从 `trash/jobs/` 永久删除(即总共最多 10 天窗口)。
|
||||
|
||||
用户保留关键产物的方式(无需 keep 标记):
|
||||
- 在 “3 天窗口”内把需要长期保存的文件从 `jobs/<submission_id>/...` **移动/复制**到 `models/`(例如权重)或 `datasets/`(例如评估输出数据);
|
||||
- 即便已被移动到回收目录,用户仍可在 “7 天窗口”内从 `trash/jobs/<submission_id>/...` 把需要的文件移到 `models/` / `datasets/`;
|
||||
- janitor 只管理 `jobs/` 与 `trash/jobs/`,不会触碰 `models/` 与 `datasets/`。
|
||||
|
||||
这里的“清理程序”我们称为 **janitor**:
|
||||
- 定义:一个后台清理执行器,按固定周期扫描“已结束且已过期”的 job 目录并删除
|
||||
- v3.0 目标:实现“3 天移入回收站 + 7 天后删除”这一条产品规则(不提供 keep/延长保留标记)
|
||||
|
||||
实现建议(按你的偏好):
|
||||
- **janitor 作为 API server 内置后台线程**运行:
|
||||
- 优点:天然可访问 SQLite(任务状态、结束时间、user_id、ray_submission_id),并能把清理结果写回 events 表用于审计
|
||||
- 部署更简单:不额外引入 cronjob/独立服务
|
||||
- 删除/移动动作建议 **直接在 GPFS/NFS 文件系统上操作**(API server 运行在 head 容器,已挂载 `/private`):
|
||||
- 第 1 阶段:`os.rename`(同文件系统原子移动)把 `jobs/<sid>` 移到 `trash/jobs/<sid>`;
|
||||
- 若跨文件系统(理论上不应发生),则降级为 copy+delete;
|
||||
- 移动前做严格路径前缀校验(必须在 `.../users/<u>/jobs/` 下)。
|
||||
- 第 2 阶段:对 `trash/jobs/<sid>` 执行递归删除(例如 `shutil.rmtree`),同样做路径前缀校验(必须在 `.../users/<u>/trash/jobs/` 下)。
|
||||
- 为什么不依赖 SFTPGo API:SFTPGo 只是用户访问协议层(SFTP/Web),目录物理就在同一份文件系统;文件系统直连更简单、也不依赖 SFTPGo 在线。
|
||||
- 如果你强烈希望“通过 SFTPGo API 删除”:
|
||||
- 可以作为可选实现/补充(例如用于统一审计或未来接入配额/策略),但不建议作为唯一手段(SFTPGo 停机不应阻塞清理)。
|
||||
|
||||
### 3.3 用户在 SFTPGo 内移动/整理文件(确认点)
|
||||
支持用户在 SFTPGo 中进行“移动/重命名/整理”(例如把权重从 `jobs/` 移动到 `models/`):
|
||||
- 前提:SFTPGo 用户权限允许对其 home 目录进行 `rename/mkdir/remove` 等操作(v3.0 默认可写)。
|
||||
- 行为:用户可以把 `jobs/` 下某些文件移动到 `models/` 或 `datasets/`,用于长期保存权重/评估产物等。
|
||||
- 与 retention 的关系:只要文件被移动出 `jobs/`,就不会被 jobs 清理逻辑删除。
|
||||
|
||||
### 3.4 路径权限规则(API 侧校验)
|
||||
v2.5 约束是 “只允许 `/private/common/...`”。
|
||||
v3.0 需要升级为:
|
||||
- 允许:
|
||||
- `/private/common/...`
|
||||
- `/private/users/<current_user_id>/...`
|
||||
- 禁止:
|
||||
- 任何其他绝对路径(例如 `/private/users/other/...`、`/etc/...`)
|
||||
|
||||
并把该规则应用到 TaskSpec 的相关字段(至少):
|
||||
- `train_file` / `val_file`
|
||||
- `code_path`:仍仅允许 `/private/common/...`(v3.0 不支持执行用户 code)
|
||||
- 本地模型路径字段:允许 `/private/users/<me>/models/...`(确认:v3.0 允许)
|
||||
|
||||
## 4. SFTPGo 方案设计(Data Management)
|
||||
|
||||
### 4.1 运行形态
|
||||
推荐用容器运行 SFTPGo(与 Ray/API 解耦),挂载同一份 `/private`:
|
||||
- `sftpgo` 容器挂载 `../../shared:/private`
|
||||
- 对外暴露:
|
||||
- SFTP 端口(建议 2022)
|
||||
- WebAdmin/API 端口(建议 8081,仅内网或管理员访问)
|
||||
|
||||
#### 4.1.1 镜像来源(现成 Docker 镜像)
|
||||
SFTPGo 有现成可用的 Docker 镜像(无需自建):
|
||||
- v3.0 推荐优先使用官方/上游发布的 `sftpgo` 镜像作为运行基座
|
||||
- 我们在 v3.0 里不需要定制 SFTPGo 代码,只需要:
|
||||
- 正确挂载 GPFS/NFS(容器内 `/private`)
|
||||
- 配置管理员账号(用于 API server 联动创建/禁用用户、重置密码)
|
||||
- 配置每用户 home/chroot
|
||||
|
||||
> 注意:具体镜像名/tag 在不同环境可能有差异(官方/镜像仓库策略会变动)。落地时建议在 `argus@h1` 上 `docker search sftpgo` 或由你们内部镜像仓库提供固定版本;v3.0 设计只要求“使用现成镜像”,不强依赖某个 tag。
|
||||
|
||||
#### 4.1.2 docker-compose 服务草案(示意)
|
||||
下面给出一个**示意**(最终以实际镜像名/tag 与你们端口规划为准):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
sftpgo:
|
||||
image: sftpgo/sftpgo:latest # 示例:使用现成镜像
|
||||
container_name: argus-sftpgo
|
||||
ports:
|
||||
- "2022:2022" # SFTP
|
||||
- "8081:8080" # WebAdmin/API(建议仅内网/管理员)
|
||||
volumes:
|
||||
- ../../shared:/private
|
||||
- ../../shared/common/sftpgo:/var/lib/sftpgo # 持久化 SFTPGo 元数据(可选/建议)
|
||||
environment:
|
||||
# 管理员账号/密码(示意,具体变量名以镜像文档为准)
|
||||
SFTPGO_ADMIN_USERNAME: "admin"
|
||||
SFTPGO_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD}"
|
||||
```
|
||||
|
||||
与 v3.0 的配合点:
|
||||
- API server 使用 `data.sftpgo.admin_api_base` + admin 凭据联动创建用户
|
||||
- 用户 home/chroot 统一指向 `/private/users/<user_id>`
|
||||
|
||||
### 4.2 用户隔离
|
||||
每个用户在 SFTPGo 中的 home dir 绑定到:
|
||||
- `/private/users/<user_id>`(chroot),用户只能读写自己的目录。
|
||||
|
||||
### 4.3 用户创建与凭据管理(两种实现,建议先做 A)
|
||||
|
||||
**方案 A(v3.0 推荐):API Server 负责“联动创建 SFTPGo 用户”**
|
||||
- 在 v2.5 的 `POST /api/v2/users` 成功后:
|
||||
- API server 调用 SFTPGo 管理 API 创建同名用户
|
||||
- 设置 home dir = `/private/users/<user_id>`
|
||||
- 设置权限(默认可写;是否只读可配置)
|
||||
- 认证方式:
|
||||
- v3.0 最小可用:用户名+密码(确认:v3.0 先 password;API 生成一次性密码,用户首次登录后要求改密)
|
||||
- 或:SSH public key(WebUI 允许上传 public key,API 写入 SFTPGo)
|
||||
|
||||
**方案 B(更强但复杂):SFTPGo 外部认证**
|
||||
- SFTPGo 把认证委托给 API server(token/SSO),SFTP 也走内部 token。
|
||||
- 复杂度高,建议 v3.0 不做,放到 v3.5 或更后。
|
||||
|
||||
### 4.4 用户上传/下载体验
|
||||
用户通过 SFTP 上传:
|
||||
- `datasets/...`(训练数据)
|
||||
- `models/...`(本地模型,可选)
|
||||
下载:
|
||||
- `jobs/<ray_submission_id>/...`(checkpoints/logs)
|
||||
|
||||
WebUI/文档提供 “路径如何写进 TaskSpec” 的指引。
|
||||
|
||||
## 5. WebUI 方案设计(最小可用)
|
||||
|
||||
### 5.1 目标页面
|
||||
v3.0 WebUI 采用“**多子页面 + 侧边导航栏**”而不是把所有功能挤到单页:
|
||||
- 原因:信息密度更可控,后续可扩展(v3.5+)且不会把一个页面做成“巨型表单/巨型列表”。
|
||||
- 实现仍保持轻量:服务端渲染(或静态 HTML + 少量 JS),不引入复杂前端工程。
|
||||
|
||||
信息架构(IA)建议如下:
|
||||
1) **登录页**(`/ui/login`)
|
||||
- 用户粘贴 token(管理员发放),浏览器保存(localStorage/sessionStorage)
|
||||
- 提供“退出登录/清空 token”
|
||||
2) **任务列表页**(`/ui/tasks`)
|
||||
- 默认列表:最近 N 条任务(按 created_at 倒序)
|
||||
- 支持过滤:workload、state(QUEUED/RUNNING/SUCCEEDED/FAILED/CANCELED)、时间范围
|
||||
- 支持快捷操作:进入详情、取消任务
|
||||
3) **新建任务页**(`/ui/tasks/new`)
|
||||
- 两种模式(二选一,均可实现):
|
||||
- **YAML 直接提交**:上传/粘贴 TaskSpec YAML(最省开发)
|
||||
- **表单生成 YAML**:选择 workload,填写核心字段(train/val/model/nnodes/gpus),生成 YAML 预览后提交
|
||||
- 提交后跳转到任务详情页
|
||||
4) **任务详情页**(`/ui/tasks/{task_id}`)
|
||||
- 顶部:task_id、workload、state、created_at、updated_at、error_summary
|
||||
- Attempt 卡片:latest attempt_no、ray_submission_id、ray_status、start/end
|
||||
- 操作区:取消任务(若非 terminal)、刷新状态、复制路径/ID
|
||||
- 链接到日志页与产物提示(SFTP 路径)
|
||||
5) **任务日志页**(`/ui/tasks/{task_id}/logs`)
|
||||
- 默认 tail=2000,可选 200/1000/5000
|
||||
- 提供“自动刷新(每 3~5 秒)”开关(简单轮询即可)
|
||||
6) **数据页**(`/ui/data`)
|
||||
- 显示 SFTP 连接信息(host/port/username)
|
||||
- 显示用户目录约定:
|
||||
- home:`/private/users/<user_id>`
|
||||
- datasets:`/private/users/<user_id>/datasets`
|
||||
- models:`/private/users/<user_id>/models`
|
||||
- jobs:`/private/users/<user_id>/jobs`
|
||||
- trash/jobs:`/private/users/<user_id>/trash/jobs`
|
||||
- 明确 retention:jobs 结束后 3 天移入回收站,回收站 7 天后删除;重要文件请移到 `models/` 或 `datasets/`
|
||||
7) **(仅管理员可见)用户管理页**(`/ui/admin/users`,可选但很有价值)
|
||||
- 创建用户、禁用用户、签发 token、重置 SFTP 密码(方案 A)
|
||||
|
||||
### 5.2 页面组织与导航(建议)
|
||||
侧边栏导航(普通用户):
|
||||
- Tasks(列表)
|
||||
- New Task(新建)
|
||||
- Data(SFTP/目录说明)
|
||||
|
||||
管理员侧边栏额外增加:
|
||||
- Admin / Users
|
||||
|
||||
### 5.3 大致示意图(wireframe)
|
||||
|
||||
下面是一个粗略示意(非最终 UI,仅表达信息结构与布局):
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Argus MVP v3.0 [user: alice] │
|
||||
├───────────────┬──────────────────────────────────────────────────────┤
|
||||
│ Side Nav │ /ui/tasks │
|
||||
│ │ │
|
||||
│ • Tasks │ [Filter] workload=all state=all [Search task_id] │
|
||||
│ • New Task │ │
|
||||
│ • Data │ Task List │
|
||||
│ • Admin(*) │ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ │ task_id workload state ... │ │
|
||||
│ │ │ mvp2-alice-ppo-... ppo RUNNING ... │ │
|
||||
│ │ │ mvp2-alice-sft-... sft SUCCEEDED... │ │
|
||||
│ │ └────────────────────────────────────────────────┘ │
|
||||
│ │ [View] [Cancel] │
|
||||
└───────────────┴──────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
任务详情页(示意):
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ /ui/tasks/{task_id} │
|
||||
├──────────────────────────────────────────────────────────────────────┤
|
||||
│ task_id: mvp2-alice-ppo-... state: RUNNING workload: ppo │
|
||||
│ created_at: ... updated_at: ... │
|
||||
│ error_summary: (empty) │
|
||||
│ │
|
||||
│ latest_attempt: a01 ray_submission_id: ...--a01 ray_status: RUNNING │
|
||||
│ [Open Logs] [Cancel Task] [Refresh] │
|
||||
│ │
|
||||
│ Artifacts (SFTP paths): │
|
||||
│ jobs/: /private/users/alice/jobs/<ray_submission_id>/ │
|
||||
│ trash/: /private/users/alice/trash/jobs/<ray_submission_id>/ │
|
||||
│ tip: move important files to /private/users/alice/models/ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 5.2 技术取舍(建议:不引入 Node 构建)
|
||||
为了降低部署复杂度,建议 v3.0 WebUI 以 “服务端渲染 + 少量 JS/HTMX” 或 “纯静态 HTML+fetch” 实现:
|
||||
- 由 API server 提供静态资源(FastAPI StaticFiles)
|
||||
- 页面调用同源 API,避免跨域与复杂前端构建链
|
||||
|
||||
## 6. API 扩展设计(概览)
|
||||
|
||||
v3.0 可以保持 `/api/v2/...` 不变,增量加:
|
||||
- SFTPGo 集成管理端点(管理员):
|
||||
- 创建/禁用用户时联动 SFTPGo
|
||||
- 重置 SFTP 密码 / 更新 SSH key
|
||||
- 用户数据端点(可选,最小化):
|
||||
- `/api/v2/me`:返回 user_id、SFTP 信息(host/port/home)
|
||||
- `/api/v2/files`:仅用于浏览/下载(上传仍走 SFTP)
|
||||
|
||||
详细见 `specs/mvp/v3.0/v3.0_api.md`。
|
||||
|
||||
## 7. 配置与部署(v3.0 新增项)
|
||||
|
||||
在 `configs/dev.yaml` 基础上扩展一组 `data` 配置(示意):
|
||||
```yaml
|
||||
data:
|
||||
shared_root: "/private" # 通常与 ray.shared_root 一致
|
||||
user_root: "/private/users" # 用户空间根目录
|
||||
allow_common_prefix: "/private/common/"
|
||||
allow_user_prefix_template: "/private/users/{user_id}/"
|
||||
|
||||
sftpgo:
|
||||
enabled: true
|
||||
host: "127.0.0.1"
|
||||
sftp_port: 2022
|
||||
admin_api_base: "http://127.0.0.1:8081/api/v2"
|
||||
admin_user: "admin"
|
||||
admin_password_env: "SFTPGO_ADMIN_PASSWORD" # 仅 head 容器内可读
|
||||
|
||||
retention:
|
||||
jobs_trash_after_days: 3
|
||||
jobs_purge_after_days: 7
|
||||
trash_root_template: "/private/users/{user_id}/trash/jobs"
|
||||
janitor_interval_s: 3600 # 每小时扫一次(可配置)
|
||||
```
|
||||
|
||||
## 8. 风险点与对策
|
||||
|
||||
1) **路径逃逸/越权读取**
|
||||
- 必须在 API 提交任务时校验路径前缀
|
||||
- SFTPGo 必须 chroot 到用户 home
|
||||
2) **大文件上传稳定性**
|
||||
- 优先用 SFTP(断点续传/可靠性更好)
|
||||
3) **用户 token 与 SFTP 凭据的生命周期**
|
||||
- token 走 v2.5 SQLite
|
||||
- SFTP 凭据建议独立(密码/SSH key),并提供 reset 流程
|
||||
4) **GPFS/NFS 权限**
|
||||
- 确保 `/private/users/<user>` 目录权限可被 SFTPGo 写入且 worker 可读
|
||||
|
||||
## 9. 已确认结论(来自你的反馈)
|
||||
1) 允许用户上传并在训练时使用自定义数据集:允许(`/private/users/<u>/datasets/...`)。
|
||||
2) 允许用户上传并在训练时使用本地模型路径:允许(`/private/users/<u>/models/...`)。
|
||||
3) v3.0 不允许执行用户自定义代码(不注入 `PYTHONPATH` 作为可执行 code path)。
|
||||
4) SFTPGo 认证方式:v3.0 先 password。
|
||||
5) WebUI:按“简单最小必要功能”做(token 粘贴登录优先)。
|
||||
|
||||
## 10. 待确认问题(需要你给结论)
|
||||
(已确认)jobs 清理执行主体:v3.0 采用 **API server 内置 janitor 后台线程**。
|
||||
232
specs/mvp/v3.0/v3.0_dev_plan.md
Normal file
232
specs/mvp/v3.0/v3.0_dev_plan.md
Normal file
@ -0,0 +1,232 @@
|
||||
# MVP v3.0 开发计划(TDD 驱动)
|
||||
|
||||
本文是 v3.0 的**工程化开发计划**,强调“先写测试,再写实现”(TDD),并将每个里程碑拆成**可独立验收**的小闭环。
|
||||
|
||||
输入依据:
|
||||
- 路线图:`specs/mvp/mvp_roadmap_v2.md`
|
||||
- v3.0 设计:`specs/mvp/v3.0/v3.0_design.md`
|
||||
- v3.0 API:`specs/mvp/v3.0/v3.0_api.md`
|
||||
- v3.0 验收:`specs/mvp/v3.0/v3.0_acceptance.md`
|
||||
- 现状基线:v2.5(Task queue + User mgmt + Stateless ray pool + 单镜像节点守护)
|
||||
|
||||
v3.0 已确认约束:
|
||||
- 允许用户数据集路径:`/private/users/<me>/datasets/...`
|
||||
- 允许用户本地模型路径:`/private/users/<me>/models/...`
|
||||
- **不允许执行用户自定义代码**(不注入 user code 到 PYTHONPATH;`code_path` 仍只允许 `/private/common/...`)
|
||||
- SFTPGo 先用 **password** 方案(方案 A:API 联动创建/管理 SFTPGo 用户)
|
||||
- jobs retention:**3 天移入回收站(trash/jobs),再 7 天永久删除**;不提供 keep/延长保留标记
|
||||
- janitor:**API server 内置后台线程**;删除/移动采用**文件系统直接操作**(不依赖 SFTPGo API)
|
||||
|
||||
---
|
||||
|
||||
## 0. TDD 规范(所有功能都遵循)
|
||||
|
||||
### 0.1 测试分层
|
||||
|
||||
1) **单元测试(fast)**
|
||||
- 纯 Python 逻辑:路径策略、SFTPGo client、retention 计算、文件移动/删除策略(用临时目录)。
|
||||
- 不依赖真实 Ray、不依赖 docker、不依赖网络。
|
||||
|
||||
2) **组件测试(中等)**
|
||||
- FastAPI 路由(含 WebUI 路由):`fastapi.testclient.TestClient`
|
||||
- mock/stub SFTPGo client 与 ray client
|
||||
|
||||
3) **端到端(慢)**
|
||||
- 在 `argus@h1` 通过 docker compose + scripts:
|
||||
- Ray 集群自动起来(head+2 worker)
|
||||
- SFTPGo 服务可用
|
||||
- 上传数据 → 提交训练 → 下载产物 → jobs 回收站/清理
|
||||
|
||||
### 0.2 代码与测试约定
|
||||
- 测试目录:`src/mvp/py/tests/`
|
||||
- 新功能必须先补齐测试用例,并让其在未实现时失败(红)
|
||||
- 最小实现让测试变绿(绿)
|
||||
- 再做重构(重构)
|
||||
- 覆盖率:继续沿用当前阈值(>= 90%)
|
||||
|
||||
---
|
||||
|
||||
## 1. 里程碑拆分(v3.0 = 5 个可验证闭环)
|
||||
|
||||
### M1:TaskSpec 路径策略升级(允许 user datasets/models;code_path 仍仅 common)
|
||||
|
||||
**目标**
|
||||
- API submit 时的路径校验从 v2.5 的 “仅 `/private/common/`” 升级为:
|
||||
- `train_file` / `val_file`:允许 `/private/common/...` 与 `/private/users/<me>/...`
|
||||
- 本地模型路径:允许 `/private/users/<me>/models/...`(不改变 YAML 结构,见实现建议)
|
||||
- `code_path`:仍仅允许 `/private/common/...`
|
||||
- 阻止越权路径(`/private/users/other/...`)与非 `/private/...` 路径。
|
||||
|
||||
**实现建议(不扩展 TaskSpec)**
|
||||
- `model_id` 字段保持不变:
|
||||
- 若 `model_id` 以 `/private/` 开头 → 视作本地模型路径
|
||||
- 否则视作 HuggingFace repo id(如 `Qwen/...`)
|
||||
|
||||
**TDD 用例(先写测试)**
|
||||
- 单测:
|
||||
- `test_paths_allow_common_and_own_user_prefix()`
|
||||
- `test_paths_reject_other_user_prefix()`
|
||||
- `test_model_id_local_path_allowed_only_under_users_models()`
|
||||
- `test_code_path_still_common_only()`
|
||||
- API 测试:
|
||||
- `test_submit_accepts_user_datasets_paths()`
|
||||
- `test_submit_rejects_cross_user_paths_404_or_400()`(按约定返回 400/403)
|
||||
|
||||
**验收点**
|
||||
- `v3.0_acceptance.md` 的 D 类安全隔离用例可由 API 测试覆盖。
|
||||
|
||||
---
|
||||
|
||||
### M2:SFTPGo 集成(方案 A:用户联动创建 + password)
|
||||
|
||||
**目标**
|
||||
- 引入 `data management (SFTPGo)`:
|
||||
- admin 创建用户时联动创建 SFTPGo 用户(home=/private/users/<user_id>,chroot)
|
||||
- password 模式:生成一次性密码(reset/create)并返回给 admin(明文只返回一次)
|
||||
- 提供用户自助信息:
|
||||
- `GET /api/v2/me` 返回 SFTP 连接信息、目录约定、retention 提示。
|
||||
|
||||
**实现建议**
|
||||
- 新增 `SFTPGoAdminClient`(同步调用):
|
||||
- 通过 `urllib` 或 `httpx`(建议 `urllib`,减少依赖;禁止 hard-code requests 使用)
|
||||
- 支持:create user / disable user / reset password(最小集合)
|
||||
- API server 启动时校验配置(enabled 时必须具备 admin 密码 env)。
|
||||
- 同步创建用户目录结构(文件系统):
|
||||
- `/private/users/<u>/{datasets,models,code,jobs,trash/jobs}`(最小必需)
|
||||
|
||||
**TDD 用例(先写测试)**
|
||||
- 单测:
|
||||
- `test_sftpgo_client_builds_correct_requests()`(不发真实网络;mock urlopen)
|
||||
- `test_user_dirs_created_on_user_create()`(tmp dir 断言目录存在)
|
||||
- API 测试:
|
||||
- `test_create_user_calls_sftpgo_client()`(stub client,断言调用参数)
|
||||
- `test_me_returns_sftp_info_and_paths()`(含 trash/jobs 与 TTL 字段)
|
||||
|
||||
**验收点**
|
||||
- `v3.0_acceptance.md` 的 A 类(用户/凭据)与 B 类(上传闭环前置)覆盖。
|
||||
|
||||
---
|
||||
|
||||
### M3:WebUI(最小可用,多页面 + 侧边栏)
|
||||
|
||||
**目标**
|
||||
- WebUI 由 API server 托管(同源,无额外 CORS):
|
||||
- `/ui/login`:token 粘贴登录(localStorage)
|
||||
- `/ui/tasks`:任务列表 + 过滤(最小)
|
||||
- `/ui/tasks/new`:YAML 提交(优先)+(可选)表单生成 YAML
|
||||
- `/ui/tasks/{task_id}`:详情页
|
||||
- `/ui/tasks/{task_id}/logs`:日志 tail + 可选自动刷新
|
||||
- `/ui/data`:SFTP 信息 + 目录/retention 提示
|
||||
- (可选)`/ui/admin/users`:管理员用户管理(若时间允许,强烈建议)
|
||||
|
||||
**实现建议**
|
||||
- 先不引入 Node 构建:
|
||||
- HTML 模板可用最简单的字符串拼接或 Jinja2(若引入 jinja2,则补齐依赖与测试)
|
||||
- 页面通过 fetch 调用 `/api/v2/...`,并复用 token header
|
||||
|
||||
**TDD 用例(先写测试)**
|
||||
- 组件测试(TestClient):
|
||||
- `test_ui_routes_render_200()`
|
||||
- `test_ui_contains_sidebar_links()`(简单断言文本包含导航链接)
|
||||
- `test_ui_tasks_detail_shows_ids()`(包含 task_id、state、ray_submission_id)
|
||||
|
||||
**验收点**
|
||||
- WebUI 能完成:登录→创建任务→查看任务→查看日志→看到 data 页提示。
|
||||
|
||||
---
|
||||
|
||||
### M4:Jobs Retention janitor(3 天移入 trash,7 天后 purge)
|
||||
|
||||
**目标**
|
||||
- API server 内置 janitor 后台线程:
|
||||
- 周期性扫描 DB 中 terminal tasks
|
||||
- 到期后执行:
|
||||
- move:`/private/users/<u>/jobs/<sid>` → `/private/users/<u>/trash/jobs/<sid>`
|
||||
- purge:递归删除 `/private/users/<u>/trash/jobs/<sid>`
|
||||
- 全程严格 path 校验,禁止越界删除
|
||||
- 清理操作记录到 DB events(审计)
|
||||
|
||||
**实现建议(数据与状态)**
|
||||
- 需要稳定的时间锚点与幂等:
|
||||
- 使用 attempts.end_time 作为 job 结束时间(latest attempt)
|
||||
- 在 tasks 表新增字段(或新表)记录:
|
||||
- `trashed_at`(首次成功 move 时间)
|
||||
- `purged_at`(成功删除时间)
|
||||
- `trash_path`(可选)
|
||||
- 幂等:重复运行不会报错(目录不存在视为已处理)
|
||||
|
||||
**TDD 用例(先写测试)**
|
||||
- 单测(用 tmpdir 构造 jobs/trash 目录):
|
||||
- `test_janitor_moves_job_to_trash_after_threshold()`
|
||||
- `test_janitor_purges_trash_after_threshold()`
|
||||
- `test_janitor_never_touches_models_or_datasets()`
|
||||
- `test_janitor_path_escape_rejected()`(恶意 path 不可删)
|
||||
- API/组件测试:
|
||||
- `test_me_includes_retention_fields()`(jobs_trash_after_days/jobs_purge_after_days)
|
||||
|
||||
**验收点**
|
||||
- `v3.0_acceptance.md` 的 C2 用例可按“把阈值调小到分钟级”完成验证。
|
||||
|
||||
---
|
||||
|
||||
### M5:端到端(h1)— SFTP 上传→训练→产物下载→回收站/清理
|
||||
|
||||
**目标**
|
||||
- 在 `argus@h1` 落一个一键脚本(或手册)跑通:
|
||||
1) `docker compose up -d` 拉起 Ray(head+2 worker)+ SFTPGo
|
||||
2) admin 创建用户 alice(联动创建 SFTPGo 用户 + password)
|
||||
3) alice 通过 SFTP 上传:
|
||||
- 数据集到 `/private/users/alice/datasets/...`
|
||||
- (可选)本地模型到 `/private/users/alice/models/...`
|
||||
4) alice 通过 API/WebUI 提交任务引用上述路径
|
||||
5) 任务成功后:
|
||||
- 从 `jobs/<sid>` 下载 logs/checkpoints
|
||||
- 把权重移动到 `models/`,验证不会被清理
|
||||
6) 把 retention 配置调小,验证 jobs→trash→purge
|
||||
|
||||
**交付建议**
|
||||
- 新增脚本(命名示例):
|
||||
- `scripts/run_all_v30_api.sh`
|
||||
- `scripts/run_e2e_v30_cases.sh`
|
||||
- 新增 `docker-compose.yaml` 中的 `sftpgo` service(或 `docker-compose.v30.yaml` 叠加文件)
|
||||
|
||||
**验收点**
|
||||
- `v3.0_acceptance.md` 全部 MUST 用例通过。
|
||||
|
||||
---
|
||||
|
||||
## 2. 风险与测试关注点
|
||||
|
||||
1) **权限与路径逃逸**
|
||||
- path policy 必须覆盖:train/val/model_id(local)/output dirs(jobs/trash)
|
||||
- 所有删除/移动必须做 prefix 校验
|
||||
|
||||
2) **并发与竞态**
|
||||
- janitor 只处理 terminal tasks,避免清理正在写入的目录
|
||||
- move 使用同文件系统 `os.replace`(原子)
|
||||
|
||||
3) **SFTPGo 可用性**
|
||||
- SFTPGo 不在线不应影响训练与 API 核心功能(除了用户创建联动)
|
||||
- janitor 不依赖 SFTPGo(文件系统直连)
|
||||
|
||||
---
|
||||
|
||||
## 3. 交付清单(代码/配置/脚本/文档)
|
||||
|
||||
### 3.1 代码
|
||||
- Path policy(v3.0)
|
||||
- SFTPGoAdminClient + user create/disable/reset password 联动
|
||||
- `/api/v2/me` 扩展(SFTP/目录/retention)
|
||||
- WebUI 路由与静态资源
|
||||
- janitor(trash+purge)后台线程 + DB 记录
|
||||
|
||||
### 3.2 配置
|
||||
- `configs/dev.yaml` 增加 `data.sftpgo`、`data.retention` 段(详见设计文档)
|
||||
|
||||
### 3.3 scripts / compose
|
||||
- compose 增加 `sftpgo`(或新增 overlay compose 文件)
|
||||
- v3.0 e2e 脚本(上传/下载/清理验证)
|
||||
|
||||
### 3.4 文档
|
||||
- 更新 `specs/mvp/v3.0/*` 与 `src/mvp/README.md`(运行方式、路径约定、SFTP 操作、retention 解释)
|
||||
|
||||
154
specs/mvp/v3.0/v3.0_progress.md
Normal file
154
specs/mvp/v3.0/v3.0_progress.md
Normal file
@ -0,0 +1,154 @@
|
||||
# MVP v3.0 进展记录(milestone log)
|
||||
|
||||
本文档用于记录 v3.0 按 `specs/mvp/v3.0/v3.0_dev_plan.md` 实施过程中的里程碑完成情况。
|
||||
约定:每完成一个里程碑,追加一条记录,包含**日期**、**完成内容**、**涉及文件**、**验证方式/结果**、**待办/风险**。
|
||||
|
||||
---
|
||||
|
||||
## M1:Path policy + tests(已完成)
|
||||
|
||||
- 日期:2025-12-30
|
||||
- 范围:按 v3.0 路径策略升级 API submit 的路径校验(不扩展 TaskSpec YAML 结构)。
|
||||
- 完成内容:
|
||||
- `code_path`:仍只允许 `/private/common/...`(v3.0 不执行 user code)。
|
||||
- `train_file`/`val_file`:允许 `/private/common/datasets/...` 或 `/private/users/<me>/datasets/...`。
|
||||
- `model_id`:若以 `/private/` 开头则视为本地路径,仅允许:
|
||||
- `/private/common/models/...` 或
|
||||
- `/private/users/<me>/models/...`
|
||||
否则仍按 HuggingFace repo id(如 `Qwen/...`)处理。
|
||||
- 拒绝跨用户路径(例如 `bob` 提交 `/private/users/alice/datasets/...`)。
|
||||
- 拒绝本地模型路径不在 `models/`(例如指向 `jobs/`)。
|
||||
- 涉及文件:
|
||||
- `src/mvp/py/argus/service/app.py`
|
||||
- `src/mvp/py/tests/test_users.py`
|
||||
- 验证方式与结果:
|
||||
- 本地单测:`.venv/bin/python -m pytest -q`
|
||||
- 结果:全部通过(`54 passed`),覆盖率阈值保持 `>= 90%`。
|
||||
- 待办/风险:
|
||||
- `model_id=/private/...` 的“本地模型路径语义”需要在用户文档/WebUI 中明确提示(避免误用)。
|
||||
- 后续 M2/M3 需要把该路径策略同步到 UI 表单/提示文本(避免用户填错路径)。
|
||||
|
||||
---
|
||||
|
||||
## M2:SFTPGo 集成(方案 A:用户联动创建 + password)(已完成)
|
||||
|
||||
- 日期:2025-12-30
|
||||
- 范围:SFTPGo(Data Management)最小集成 + 用户自助信息 `/api/v2/me` + 用户目录结构落盘。
|
||||
- 完成内容:
|
||||
- 新增 `data` 配置段:
|
||||
- `data.user_root`:用户数据根目录(默认 `/private/users`)
|
||||
- `data.sftpgo`:SFTPGo 可选联动(enabled/host/sftp_port/admin_api_base/admin_user/admin_password_env)
|
||||
- `data.retention`:jobs 过期策略配置(3 天移入 trash,7 天 purge;janitor 在 M4 实现)
|
||||
- 新增 `SFTPGoAdminClient`(`urllib` 实现,不使用 `requests`):
|
||||
- `create_user` / `disable_user` / `reset_password`(最小集合)
|
||||
- API server 增强:
|
||||
- `POST /api/v2/users`:创建 DB user + 同步创建目录结构(`datasets/models/code/jobs/trash/jobs`)
|
||||
- 当 `data.sftpgo.enabled=true` 时,创建用户会联动调用 SFTPGo admin API,并返回一次性密码(明文仅返回一次,服务端不保存)
|
||||
- `POST /api/v2/users/{user_id}:disable`:禁用用户(SFTPGo 禁用 best-effort)
|
||||
- `POST /api/v2/users/{user_id}/sftp:reset_password`:管理员重置一次性密码(SFTPGo enabled 才允许)
|
||||
- `GET /api/v2/me`:返回当前用户的目录约定、retention 提示,以及(可选)SFTP 连接信息
|
||||
- 同步更新 `src/mvp/configs/dev.yaml`:补齐 v3.0 相关 `data.*` 配置(默认关闭 sftpgo)。
|
||||
- 涉及文件:
|
||||
- `src/mvp/py/argus/service/config.py`
|
||||
- `src/mvp/py/argus/service/sftpgo.py`
|
||||
- `src/mvp/py/argus/service/app.py`
|
||||
- `src/mvp/py/tests/test_sftpgo.py`
|
||||
- `src/mvp/py/tests/test_users.py`
|
||||
- `src/mvp/py/tests/test_app.py`
|
||||
- `src/mvp/py/tests/test_service_config.py`
|
||||
- `src/mvp/configs/dev.yaml`
|
||||
- `specs/mvp/v3.0/v3.0_api.md`
|
||||
- 验证方式与结果:
|
||||
- 本地单测:`.venv/bin/python -m pytest -q`
|
||||
- 结果:全部通过(`62 passed`),覆盖率 `90.11%`(阈值 `>= 90%`)。
|
||||
- 待办/风险:
|
||||
- M2 仅做了“API 侧联动 + 单测”,未在真实 SFTPGo 容器上端到端验证(按计划在 M5 完成)。
|
||||
- 目录创建依赖文件系统权限:生产部署时需确保 API/head 容器对 `/private/users` 可写。
|
||||
|
||||
---
|
||||
|
||||
## M3:WebUI(最小可用,多页面 + 侧边栏)(已完成)
|
||||
|
||||
- 日期:2025-12-30
|
||||
- 范围:API server 托管最小 WebUI(同源,不引入 Node 构建),用于登录/提交/查看任务与日志、查看 data 信息。
|
||||
- 完成内容:
|
||||
- 新增 UI 路由(HTML+少量 JS):
|
||||
- `/ui`(重定向到 tasks)
|
||||
- `/ui/login`:token 粘贴并写入浏览器 localStorage(key=`mvp_token`)
|
||||
- `/ui/tasks`:任务队列列表(调用 `/api/v2/queue`)
|
||||
- `/ui/tasks/new`:提交 TaskSpec YAML(POST `/api/v2/tasks`)
|
||||
- `/ui/tasks/{task_id}`:任务详情(GET `/api/v2/tasks/{task_id}`,支持 cancel)
|
||||
- `/ui/tasks/{task_id}/logs`:日志查看(GET `/api/v2/tasks/{task_id}/logs`,可选自动刷新)
|
||||
- `/ui/data`:展示 `/api/v2/me` 返回的路径/SFTP/retention 信息
|
||||
- 统一侧边栏导航:Tasks / New Task / Data / Login。
|
||||
- UI 不做服务端 session:所有 API 调用均由浏览器带 `Authorization: Bearer <token>`(localStorage 注入)。
|
||||
- 涉及文件:
|
||||
- `src/mvp/py/argus/service/ui.py`
|
||||
- `src/mvp/py/argus/service/app.py`
|
||||
- `src/mvp/py/tests/test_ui.py`
|
||||
- 验证方式与结果:
|
||||
- 本地单测:`.venv/bin/python -m pytest -q`
|
||||
- 结果:全部通过(`65 passed`),覆盖率 `90.53%`(阈值 `>= 90%`)。
|
||||
- 待办/风险:
|
||||
- WebUI 当前为“骨架+API 驱动”,不做复杂交互与大文件下载;上传/下载仍以 SFTP 为主(按设计)。
|
||||
- Starlette TestClient 的 `allow_redirects` 有弃用告警(不影响功能,可在后续清理)。
|
||||
|
||||
---
|
||||
|
||||
## M4:Jobs Retention janitor(3 天移入 trash,7 天后 purge)(已完成)
|
||||
|
||||
- 日期:2025-12-30
|
||||
- 范围:API server 内置后台线程,对“已结束 attempt”的 job 目录执行保留策略(文件系统直连,不依赖 SFTPGo)。
|
||||
- 完成内容:
|
||||
- 新增 `JobsJanitor`:
|
||||
- 以 `attempts.end_time` 为基准计算 TTL(从 job 结束开始算)
|
||||
- `>= 3 天 && < 7 天`:把目录从 `.../jobs/<ray_submission_id>` 移动到 `.../trash/jobs/<ray_submission_id>`
|
||||
- `>= 7 天`:确保目录进入 trash 后删除(`shutil.rmtree`)
|
||||
- 对缺失目录、异常移动/删除为 best-effort(不影响服务主流程)
|
||||
- DB 增强:新增查询 `list_ended_attempts_before()`,用于 janitor 扫描候选 attempt。
|
||||
- API server 启动时启动 janitor 线程(可通过 `data.retention.janitor_interval_s` 控制;<=0 视为关闭)。
|
||||
- 涉及文件:
|
||||
- `src/mvp/py/argus/service/janitor.py`
|
||||
- `src/mvp/py/argus/service/db.py`
|
||||
- `src/mvp/py/argus/service/app.py`
|
||||
- `src/mvp/py/tests/test_janitor.py`
|
||||
- 验证方式与结果:
|
||||
- 本地单测:`.venv/bin/python -m pytest -q`
|
||||
- 结果:全部通过(`75 passed`),覆盖率 `90.72%`(阈值 `>= 90%`)。
|
||||
- 待办/风险:
|
||||
- M4 只做“逻辑 + 单测”,实际 `/private/users/...` 的权限与在 `argus@h1` 的行为验证放到 M5(端到端)。
|
||||
|
||||
---
|
||||
|
||||
## M5:端到端(h1)— SFTPGo compose + v3.0 E2E 脚本(已完成:交付脚本/配置)
|
||||
|
||||
- 日期:2025-12-30
|
||||
- 范围:补齐 h1 端到端所需的 compose/service、配置与一键脚本(实际运行/验收由你在 `argus@h1` 执行)。
|
||||
- 完成内容:
|
||||
- SFTPGo 集成到 `docker compose`:
|
||||
- 新增 `argus-sftpgo` service(SFTP 2022;Admin API/UI 8080→host 8081,避免与 MVP API 8080 冲突)
|
||||
- 同挂载 `../../shared:/private`,并持久化元数据到 `../../shared/common/sftpgo`
|
||||
- SFTPGoAdminClient 实装(对齐 upstream OpenAPI):
|
||||
- `GET /api/v2/token`(BasicAuth)获取 admin token
|
||||
- `POST /api/v2/users` 创建用户(含 `permissions: {"/":["*"]}`)
|
||||
- `PUT /api/v2/users/{username}` 禁用/重置密码
|
||||
- 新增 v3.0 dev 配置:`configs/dev_v30.yaml`(启用 `data.sftpgo` 并配置 `admin_api_base=http://argus-sftpgo:8080/api/v2`)
|
||||
- 新增 v3.0 一键脚本:
|
||||
- `scripts/run_all_v30_api.sh`:起 Ray+SFTPGo、启动 API、创建用户并提交 PPO/GRPO/SFT(引用 user dataset 路径)
|
||||
- `scripts/run_e2e_v30_cases.sh`:最小 E2E runner(HP-1)
|
||||
- API 启动脚本增强:`scripts/60_start_api.sh` 支持透传 `SFTPGO_ADMIN_PASSWORD` 到 head 容器内的 API 进程。
|
||||
- 涉及文件:
|
||||
- `src/mvp/docker-compose.yaml`
|
||||
- `src/mvp/configs/dev_v30.yaml`
|
||||
- `src/mvp/scripts/run_all_v30_api.sh`
|
||||
- `src/mvp/scripts/run_e2e_v30_cases.sh`
|
||||
- `src/mvp/scripts/60_start_api.sh`
|
||||
- `src/mvp/py/argus/service/sftpgo.py`
|
||||
- `src/mvp/py/tests/test_sftpgo.py`
|
||||
- `src/mvp/README.md`
|
||||
- `specs/mvp/v3.0/v3.0_api.md`
|
||||
- 验证方式与结果:
|
||||
- 本地单测:`.venv/bin/python -m pytest -q`
|
||||
- 结果:全部通过(`75 passed`),覆盖率 `90.35%`(阈值 `>= 90%`)。
|
||||
- 待办/风险:
|
||||
- 需要你在 `argus@h1` 实跑 `scripts/run_all_v30_api.sh` 完成真正的 SFTP 上传/下载与 retention 验收(按 `v3.0_acceptance.md`)。
|
||||
166
specs/mvp/v3.0/v3.0_summary.md
Normal file
166
specs/mvp/v3.0/v3.0_summary.md
Normal file
@ -0,0 +1,166 @@
|
||||
# MVP v3.0 迭代总结(Ray + SFTPGo + API + WebUI)
|
||||
|
||||
本文总结 v3.0 迭代最终落地的功能、架构、运行方式、验收点与已知限制,便于后续评审、交接与继续迭代。
|
||||
|
||||
相关更详细文档:
|
||||
- `specs/mvp/v3.0/v3.0_design.md`
|
||||
- `specs/mvp/v3.0/v3.0_api.md`
|
||||
- `specs/mvp/v3.0/v3.0_dev_plan.md`
|
||||
- `specs/mvp/v3.0/v3.0_acceptance.md`
|
||||
- `specs/mvp/v3.0/v3.0_progress.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. 目标与范围
|
||||
|
||||
v3.0 作为“第一版可发布”的最小闭环,主要新增:
|
||||
- **WebUI**:最小可用的人机界面(登录、任务提交与查看、数据入口、管理员入口)。
|
||||
- **用户管理**:基于内部 token 的用户体系(admin 与普通用户),支持创建用户与签发 token。
|
||||
- **数据管理入口(SFTPGo)**:用户通过 SFTP/WebClient 上传下载自己的数据;同时暴露只读的共享数据/缓存目录(common)用于复用。
|
||||
- **保持训练闭环**:仍通过 Ray Job 提交到集群执行(PPO/GRPO/SFT 三类 workload 都验证)。
|
||||
|
||||
明确不做(本迭代保持最小):
|
||||
- 不支持用户自定义训练代码(TaskSpec 的 `code_path` 固定走 common 下的 verl snapshot 策略)。
|
||||
- 不做复杂资源排队优化/多集群/多租隔离策略(目前隔离粒度主要在用户 jobs 目录层)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 系统架构(最终形态)
|
||||
|
||||
核心组件:
|
||||
- **Ray 集群(容器)**
|
||||
- `argus-ray-head`:head 节点(无 GPU/不跑训练),提供 Ray Dashboard 与 Job Server。
|
||||
- `argus-ray-worker-0/1`:worker 节点(有 GPU),承载训练任务。
|
||||
- worker 以 “stateless + watchdog 自动连接 head” 的方式加入集群。
|
||||
- **API Server(运行在 head 容器内)**
|
||||
- 读取 YAML 配置(dev/prod),维护任务队列(sqlite),并周期性调度将任务提交到 Ray。
|
||||
- 同时承载 WebUI(`/ui`)。
|
||||
- **SFTPGo(容器)**
|
||||
- 提供 SFTP(端口 `2022`)与 Web Client/Admin(端口 `8081` 映射到容器 8080)。
|
||||
- 用户 home 为 `/private/users/<user>`,默认可读写。
|
||||
- 额外提供 `/common/*` 共享只读入口(见第 4 节)。
|
||||
- **共享存储(NFS/GPFS 等挂载到容器内 `/private`)**
|
||||
- `/private/common`:共享缓存(hf、datasets、models、db、logs 等)。
|
||||
- `/private/users/<user>`:用户隔离目录(jobs/datasets/models/code/trash 等)。
|
||||
|
||||
---
|
||||
|
||||
## 3. 任务与调度(Task / Ray Job)
|
||||
|
||||
### 3.1 Task(平台概念)
|
||||
- 用户向 API 提交 TaskSpec(YAML),平台分配 `task_id`(可读、包含用户名)。
|
||||
- `task_id` 对应内部状态机与重试逻辑;底层每次提交 Ray Job 会产生 attempt 与 `ray_submission_id`。
|
||||
|
||||
### 3.2 Ray Job(Ray 概念)
|
||||
- 真正执行训练的 driver 通过 Ray Job 运行在集群 worker 上(避免 head 承载训练)。
|
||||
- head 节点通过 `--num-cpus=0` / 自定义资源等策略避免调度到 head。
|
||||
|
||||
### 3.3 VERL 资源预检查的处理
|
||||
- VERL 在创建资源池时会做 fail-fast 资源预检查(如“可用 GPU 不足”直接报错退出)。
|
||||
- v3.0 延续 v2.x 的策略:服务端识别失败原因并按策略重试/回退(具体见 scheduler 实现与 v2.5/3.0 文档)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 数据管理(SFTPGo)与 common 只读目录
|
||||
|
||||
### 4.1 用户目录(读写)
|
||||
- 用户通过 SFTP/WebClient 访问自己的 home:`/private/users/<user>`
|
||||
- 目录结构(至少):`datasets/ models/ code/ jobs/ trash/ common/`
|
||||
|
||||
### 4.2 common 只读(方案 A:Virtual Folder)
|
||||
本迭代采用 SFTPGo 的 Virtual Folder + 路径权限覆盖,实现用户可读共享目录但不可写。
|
||||
|
||||
最终对外暴露为:
|
||||
- `/common/datasets`(只读)
|
||||
- **mapped_path 指向真实目录 `/private/datasets`**(避免 `/private/common/datasets` 中大量 symlink 导致的 WebClient “权限不足/越界”问题)
|
||||
- `/common/hf`(只读)
|
||||
- mapped_path 指向 `/private/hf`
|
||||
|
||||
备注:
|
||||
- `/private/common/datasets` 内部存在 symlink(如 `gsm8k -> /private/datasets/gsm8k`),如果虚拟目录映射到 symlink 根目录,SFTPGo 会把 symlink 跳转视为“逃逸 root”,导致点击进入时报权限不足;因此选择直接映射到真实目录根。
|
||||
|
||||
---
|
||||
|
||||
## 5. WebUI(最小可用)
|
||||
|
||||
入口:
|
||||
- `/ui/login`:粘贴 token(存 browser `localStorage`)
|
||||
- `/ui/tasks`:任务列表(Running/Pending/Completed),Completed 支持分页
|
||||
- `/ui/tasks/new`:提交任务(PPO/GRPO/SFT 三套样例可一键填充)
|
||||
- `/ui/data`:展示当前用户名、支持重置 SFTPGo 密码并复制;提供跳转到 SFTPGo WebClient;提示 FileZilla 等客户端用法
|
||||
- `/ui/admin`:管理员入口(创建用户、签发 token、用户列表)
|
||||
- 导航栏提供 Ray Dashboard 快捷跳转(当前 IP 的 `:8265`)
|
||||
|
||||
关于 admin 页面权限:
|
||||
- admin 页面本身可访问,但其数据请求必须携带 admin token;否则会在页面内显示 401/403/错误信息(满足“需要先提供 admin token 才能看到内容”)。
|
||||
|
||||
---
|
||||
|
||||
## 6. API(v3.0 新增/强化点)
|
||||
|
||||
核心接口(节选):
|
||||
- 认证:
|
||||
- Bearer token:`MVP_INTERNAL_TOKEN`(admin)或用户 token(由 admin 签发)
|
||||
- 用户管理(admin):
|
||||
- `POST /api/v2/users` 创建用户(并初始化用户目录)
|
||||
- `GET /api/v2/users` 获取用户列表(包含最新 token、创建/更新时间等)
|
||||
- `POST /api/v2/users/{user_id}/tokens` 签发用户 token
|
||||
- 任务:
|
||||
- `POST /api/v2/tasks` 提交 TaskSpec(YAML)
|
||||
- `GET /api/v2/tasks` 任务列表(支持 states/limit/offset,用于 Completed 分页)
|
||||
- `GET /api/v2/tasks/{task_id}`、`POST /api/v2/tasks/{task_id}:cancel`、`GET /api/v2/tasks/{task_id}/logs`
|
||||
- `GET /api/v2/queue`(运行中/待调度概览)
|
||||
- 数据/SFTP:
|
||||
- `GET /api/v2/me` 返回用户路径信息、SFTP 连接信息,并 best-effort 对齐 SFTPGo 用户配置
|
||||
- `POST /api/v2/me/sftp:reset_password` 用户自助重置 SFTPGo 密码(一次性返回明文)
|
||||
|
||||
安全取舍说明(当前为内网/开发优先):
|
||||
- 为了 Admin WebUI “可查看并复制 token”,数据库持久化存储了 `token_plain`(明文 token)。
|
||||
- 这在生产场景通常不建议;未来可改为只展示“重置/重新签发”而不回显明文,或只回显一次。
|
||||
|
||||
---
|
||||
|
||||
## 7. 持久化与清理
|
||||
|
||||
- 任务队列:sqlite(WAL 模式)
|
||||
- SFTPGo:自带 sqlite db(容器挂载持久化目录)
|
||||
- Jobs 目录清理策略(服务端 janitor):
|
||||
- job 结束后 3 天移动到回收目录(trash)
|
||||
- 回收目录再保留 7 天后删除
|
||||
|
||||
---
|
||||
|
||||
## 8. 运行方式与脚本
|
||||
|
||||
开发/验收脚本:
|
||||
- `src/mvp/scripts/run_all_v30_api.sh`:端到端拉起(Ray + SFTPGo + API),并通过 API 提交 PPO/GRPO/SFT,等待完成并验收
|
||||
- 其他脚本用于启动/停止 API、准备数据与模型、探测服务就绪等(详见 scripts 目录与 README)
|
||||
|
||||
典型端到端(示例参数):
|
||||
- `MVP_INTERNAL_TOKEN=my-dev-token`
|
||||
- `SFTPGO_ADMIN_PASSWORD=my-dev-sftpgo-admin`
|
||||
- 支持 `RESET_DB/RESET_SFTPGO` 用于测试环境重置
|
||||
|
||||
---
|
||||
|
||||
## 9. 验证结果(已跑通)
|
||||
|
||||
在 `argus@h1` 环境中已完成端到端验证:
|
||||
- Ray 集群可用(head + 2 worker)
|
||||
- API server + WebUI 可用
|
||||
- SFTPGo(admin + 普通用户)可用
|
||||
- 通过 API 连续提交 PPO/GRPO/SFT 三种任务均能完成(SUCCEEDED)
|
||||
- 用户可以登录 SFTPGo WebClient/SFTP,访问自己的目录,并访问 `/common/datasets`、`/common/hf` 的只读内容
|
||||
|
||||
同时本地单测通过:
|
||||
- pytest 全绿
|
||||
- 覆盖率阈值 >= 90%
|
||||
|
||||
---
|
||||
|
||||
## 10. 已知限制 & 后续可改进
|
||||
|
||||
- WebUI 当前为最小版,交互与权限提示仍偏“工程化”而非产品化(后续可增强错误提示、搜索筛选、任务详情聚合等)。
|
||||
- token 明文持久化仅适合内网/开发场景;生产建议改为一次性展示或支持撤销/轮换策略。
|
||||
- SFTPGo 虚拟目录目前保留了历史遗留映射(例如 `/common/models` 可能残留),后续可在升级脚本中做一次性清理与迁移。
|
||||
|
||||
@ -16,3 +16,11 @@
|
||||
- CLI 提交流程:`scripts/run_all_cli.sh`
|
||||
- API 提交流程:`scripts/run_all_api.sh`
|
||||
- v2.5(Stateless worker + user 隔离 jobs)E2E:`scripts/run_all_v25_api.sh`
|
||||
- v3.0(WebUI + SFTPGo + user datasets/models + jobs retention)E2E:`scripts/run_all_v30_api.sh`
|
||||
|
||||
v3.0 访问入口(dev/h1):
|
||||
- WebUI:`http://127.0.0.1:8080/ui`
|
||||
- Ray Dashboard:`http://127.0.0.1:8265`
|
||||
- SFTPGo:
|
||||
- SFTP:`127.0.0.1:2022`
|
||||
- Admin API/UI:`http://127.0.0.1:8081`(容器内 8080,host 映射到 8081 避免与 API server 冲突)
|
||||
|
||||
@ -32,3 +32,24 @@ service:
|
||||
retry_interval_s: 60
|
||||
max_running_tasks: 1
|
||||
|
||||
# v3.0: user data management (filesystem + SFTPGo)
|
||||
data:
|
||||
# All user writable data is placed under this root:
|
||||
# /private/users/<user_id>/{datasets,models,code,jobs,trash/jobs}
|
||||
user_root: "/private/users"
|
||||
|
||||
# SFTPGo is optional in dev; when enabled, admin endpoints will call SFTPGo admin API.
|
||||
# Admin password is provided by env var `data.sftpgo.admin_password_env`.
|
||||
sftpgo:
|
||||
enabled: false
|
||||
host: "" # shown to users via GET /api/v2/me
|
||||
sftp_port: 2022
|
||||
admin_api_base: "" # e.g. http://argus-sftpgo:8080
|
||||
admin_user: "admin"
|
||||
admin_password_env: "SFTPGO_ADMIN_PASSWORD"
|
||||
|
||||
# Jobs retention policy (v3.0 janitor): move to trash after 3d, purge after 7d.
|
||||
retention:
|
||||
jobs_trash_after_days: 3
|
||||
jobs_purge_after_days: 7
|
||||
janitor_interval_s: 3600
|
||||
|
||||
51
src/mvp/configs/dev_v30.yaml
Normal file
51
src/mvp/configs/dev_v30.yaml
Normal file
@ -0,0 +1,51 @@
|
||||
ray:
|
||||
# Ray Job server address (head 容器内视角)
|
||||
address: "http://127.0.0.1:8265"
|
||||
|
||||
# 共享根路径(容器内统一 /private,对齐生产)
|
||||
shared_root: "/private"
|
||||
|
||||
# 强制 driver 落 worker(head 不跑训练)
|
||||
entrypoint_num_cpus: 1
|
||||
entrypoint_resources:
|
||||
worker_node: 1
|
||||
|
||||
# 所有 job 通用 runtime env
|
||||
runtime_env:
|
||||
env_vars:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
PYTHONUNBUFFERED: "1"
|
||||
|
||||
# v3.0 先不支持 user code 执行
|
||||
user_code_path: "/private/user/code"
|
||||
|
||||
service:
|
||||
api:
|
||||
host: "0.0.0.0"
|
||||
port: 8080
|
||||
auth:
|
||||
token_env: "MVP_INTERNAL_TOKEN"
|
||||
sqlite:
|
||||
db_path: "/private/common/db/mvp.sqlite3"
|
||||
scheduler:
|
||||
tick_s: 5
|
||||
retry_interval_s: 60
|
||||
max_running_tasks: 1
|
||||
|
||||
data:
|
||||
user_root: "/private/users"
|
||||
sftpgo:
|
||||
enabled: true
|
||||
# Returned to users by GET /api/v2/me. For h1 E2E, usually connect to the host IP.
|
||||
host: "127.0.0.1"
|
||||
sftp_port: 2022
|
||||
# Admin API base should include /api/v2 (SFTPGo OpenAPI server base).
|
||||
# From head container, access SFTPGo by service name on the compose network.
|
||||
admin_api_base: "http://argus-sftpgo:8080/api/v2"
|
||||
admin_user: "admin"
|
||||
admin_password_env: "SFTPGO_ADMIN_PASSWORD"
|
||||
retention:
|
||||
jobs_trash_after_days: 3
|
||||
jobs_purge_after_days: 7
|
||||
janitor_interval_s: 3600
|
||||
|
||||
@ -38,6 +38,36 @@ services:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
PYTHONUNBUFFERED: "1"
|
||||
|
||||
# v3.0: Data management service (SFTPGo).
|
||||
# - SFTP: 2022
|
||||
# - Admin API/UI: 8080 (mapped to host 8081 to avoid collision with MVP API server on 8080)
|
||||
#
|
||||
# NOTE: This is for dev / h1 E2E. In prod you may use an internal mirror image/tag and different ports.
|
||||
sftpgo:
|
||||
image: drakkan/sftpgo:latest
|
||||
container_name: argus-sftpgo
|
||||
user: "0:0"
|
||||
ports:
|
||||
- "2022:2022"
|
||||
- "8081:8080"
|
||||
volumes:
|
||||
- ../../shared:/private
|
||||
- ../../shared/common/sftpgo:/var/lib/sftpgo
|
||||
networks:
|
||||
- argus-ray-net
|
||||
environment:
|
||||
# Create a default admin on first start (used by API server to manage users).
|
||||
# Override on host as needed:
|
||||
# export SFTPGO_ADMIN_PASSWORD=...
|
||||
SFTPGO_DATA_PROVIDER__CREATE_DEFAULT_ADMIN: "true"
|
||||
# Persist the sqlite DB under the mounted metadata dir; otherwise it defaults to a relative path.
|
||||
SFTPGO_DATA_PROVIDER__NAME: "/var/lib/sftpgo/sftpgo.db"
|
||||
SFTPGO_DEFAULT_ADMIN_USERNAME: "admin"
|
||||
SFTPGO_DEFAULT_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
|
||||
# Explicitly pin default ports via env-var schema (double-underscore for nesting).
|
||||
SFTPGO_HTTPD__BINDINGS__0__PORT: "8080"
|
||||
SFTPGO_SFTPD__BINDINGS__0__PORT: "2022"
|
||||
|
||||
ray_worker_0:
|
||||
image: argus/argus-ray-node:v2.5
|
||||
container_name: argus-ray-worker-0
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import secrets
|
||||
import threading
|
||||
from typing import Any
|
||||
|
||||
@ -12,7 +13,10 @@ from argus.ray.models import JobSpec, RayConfig
|
||||
|
||||
from .config import V2Config
|
||||
from .db import Db
|
||||
from .janitor import JobsJanitor
|
||||
from .scheduler import Scheduler
|
||||
from .sftpgo import SFTPGoAdminClient, SFTPGoError
|
||||
from .ui import register_ui_routes
|
||||
|
||||
|
||||
def _utc_now_iso() -> str:
|
||||
@ -38,11 +42,48 @@ def create_app(config_path: str) -> FastAPI:
|
||||
db.init()
|
||||
|
||||
scheduler = Scheduler(db=db, ray_cfg=ray_cfg, v2_cfg=v2_cfg)
|
||||
janitor = JobsJanitor(
|
||||
db=db,
|
||||
user_root=v2_cfg.data.user_root,
|
||||
trash_after_days=v2_cfg.data.retention.jobs_trash_after_days,
|
||||
purge_after_days=v2_cfg.data.retention.jobs_purge_after_days,
|
||||
interval_s=v2_cfg.data.retention.janitor_interval_s,
|
||||
)
|
||||
stop_flag = threading.Event()
|
||||
tool = scheduler.tool
|
||||
|
||||
app = FastAPI(title="mvp-v2", version="2.0")
|
||||
|
||||
def _user_home(user_id: str) -> str:
|
||||
base = v2_cfg.data.user_root.rstrip("/")
|
||||
return f"{base}/{user_id}"
|
||||
|
||||
def _ensure_user_dirs(user_id: str) -> None:
|
||||
home = _user_home(user_id)
|
||||
# Note: create a physical /common directory under each user's home so that
|
||||
# SFTPGo WebClient shows a "common" entry at the root. The actual shared
|
||||
# content is exposed via SFTPGo virtual folders under /common/{datasets,models}.
|
||||
for rel in ("datasets", "models", "code", "jobs", "trash/jobs", "common"):
|
||||
os.makedirs(f"{home}/{rel}", exist_ok=True)
|
||||
|
||||
def _sftpgo_enabled() -> bool:
|
||||
return bool(v2_cfg.data.sftpgo.enabled)
|
||||
|
||||
def _sftpgo_client() -> SFTPGoAdminClient:
|
||||
cfg = v2_cfg.data.sftpgo
|
||||
if not cfg.admin_api_base:
|
||||
raise HTTPException(status_code=500, detail="sftpgo enabled but data.sftpgo.admin_api_base is empty")
|
||||
pw = os.environ.get(cfg.admin_password_env, "")
|
||||
if not pw:
|
||||
raise HTTPException(status_code=500, detail=f"missing env: {cfg.admin_password_env}")
|
||||
shared_root = ray_cfg.shared_root.rstrip("/")
|
||||
return SFTPGoAdminClient(
|
||||
admin_api_base=str(cfg.admin_api_base),
|
||||
admin_user=str(cfg.admin_user),
|
||||
admin_password=pw,
|
||||
common_root=f"{shared_root}/common",
|
||||
)
|
||||
|
||||
def _auth(req: Request) -> dict[str, Any]:
|
||||
token_env = v2_cfg.auth.token_env
|
||||
admin_token = os.environ.get(token_env, "")
|
||||
@ -74,6 +115,9 @@ def create_app(config_path: str) -> FastAPI:
|
||||
def _startup() -> None:
|
||||
t = threading.Thread(target=scheduler.run_forever, args=(stop_flag,), daemon=True)
|
||||
t.start()
|
||||
if int(janitor.interval_s) > 0:
|
||||
tj = threading.Thread(target=janitor.run_forever, args=(stop_flag,), daemon=True)
|
||||
tj.start()
|
||||
|
||||
@app.on_event("shutdown")
|
||||
def _shutdown() -> None:
|
||||
@ -93,7 +137,42 @@ def create_app(config_path: str) -> FastAPI:
|
||||
row = db.create_user(user_id=user_id, display_name=str(display_name) if display_name is not None else None)
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=409, detail=f"user create failed: {e!r}")
|
||||
return {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")}
|
||||
|
||||
# v3.0: create user dir structure (datasets/models/code/jobs/trash).
|
||||
try:
|
||||
_ensure_user_dirs(user_id)
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"failed to create user dirs: {e!r}")
|
||||
|
||||
out: dict[str, Any] = {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")}
|
||||
|
||||
# v3.0: scheme A (password) SFTPGo integration.
|
||||
if _sftpgo_enabled():
|
||||
pw = secrets.token_urlsafe(12)
|
||||
try:
|
||||
_sftpgo_client().create_user(username=user_id, password=pw, home_dir=_user_home(user_id))
|
||||
except SFTPGoError as e:
|
||||
# Make create user idempotent for retries:
|
||||
# If the user already exists in SFTPGo, reset password and enable it.
|
||||
if "http error: 409" in str(e):
|
||||
try:
|
||||
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
|
||||
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
|
||||
except Exception as e2:
|
||||
raise HTTPException(status_code=502, detail=f"sftpgo upsert user failed: {e2!r}")
|
||||
else:
|
||||
raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}")
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}")
|
||||
out["sftp"] = {
|
||||
"username": user_id,
|
||||
"password": pw, # one-time return to admin; do not persist plaintext
|
||||
"host": v2_cfg.data.sftpgo.host,
|
||||
"port": int(v2_cfg.data.sftpgo.sftp_port),
|
||||
}
|
||||
return out
|
||||
|
||||
@app.post("/api/v2/users/{user_id}/tokens")
|
||||
async def issue_token(user_id: str, req: Request) -> dict[str, Any]:
|
||||
@ -104,6 +183,12 @@ def create_app(config_path: str) -> FastAPI:
|
||||
token = db.issue_token(user_id=user_id)
|
||||
return {"user_id": user_id, "token": token}
|
||||
|
||||
@app.get("/api/v2/users")
|
||||
async def list_users(req: Request, limit: int = 200) -> dict[str, Any]:
|
||||
_require_admin(req)
|
||||
lim = max(1, min(int(limit), 1000))
|
||||
return {"users": db.list_users(limit=lim), "limit": lim}
|
||||
|
||||
@app.post("/api/v2/users/{user_id}:disable")
|
||||
async def disable_user(user_id: str, req: Request) -> dict[str, Any]:
|
||||
_require_admin(req)
|
||||
@ -111,8 +196,122 @@ def create_app(config_path: str) -> FastAPI:
|
||||
if not u:
|
||||
raise HTTPException(status_code=404, detail="user not found")
|
||||
db.disable_user(user_id=user_id)
|
||||
if _sftpgo_enabled():
|
||||
try:
|
||||
_sftpgo_client().disable_user(username=user_id, home_dir=_user_home(user_id))
|
||||
except Exception:
|
||||
# Best-effort: DB state is source of truth for API auth; SFTPGo sync can lag.
|
||||
pass
|
||||
return {"user_id": user_id, "state": "DISABLED"}
|
||||
|
||||
@app.post("/api/v2/users/{user_id}/sftp:reset_password")
|
||||
async def reset_sftp_password(user_id: str, req: Request) -> dict[str, Any]:
|
||||
_require_admin(req)
|
||||
u = db.get_user(user_id)
|
||||
if not u:
|
||||
raise HTTPException(status_code=404, detail="user not found")
|
||||
if not _sftpgo_enabled():
|
||||
raise HTTPException(status_code=400, detail="sftpgo is not enabled")
|
||||
pw = secrets.token_urlsafe(12)
|
||||
try:
|
||||
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}")
|
||||
return {"user_id": user_id, "password": pw}
|
||||
|
||||
@app.post("/api/v2/me/sftp:reset_password")
|
||||
async def reset_my_sftp_password(req: Request) -> dict[str, Any]:
|
||||
"""
|
||||
v3.0 WebUI: allow a user to rotate their own SFTPGo password and copy it.
|
||||
Note: SFTPGo does not allow reading the current password, so this endpoint returns a new one-time password.
|
||||
"""
|
||||
subject = _auth(req)
|
||||
user_id = str(subject["user_id"])
|
||||
u = db.get_user(user_id)
|
||||
if not u:
|
||||
raise HTTPException(status_code=404, detail="user not found")
|
||||
if not _sftpgo_enabled():
|
||||
raise HTTPException(status_code=400, detail="sftpgo is not enabled")
|
||||
pw = secrets.token_urlsafe(12)
|
||||
try:
|
||||
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
|
||||
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}")
|
||||
return {"user_id": user_id, "password": pw}
|
||||
|
||||
@app.get("/api/v2/me")
|
||||
async def me(req: Request) -> dict[str, Any]:
|
||||
subject = _auth(req)
|
||||
user_id = str(subject["user_id"])
|
||||
try:
|
||||
_ensure_user_dirs(user_id)
|
||||
except Exception:
|
||||
# Best-effort: user may still be able to use API even if FS init fails.
|
||||
pass
|
||||
# Best-effort: reconcile SFTPGo user to include /common read-only mounts.
|
||||
# This makes the virtual folders visible without requiring a password reset.
|
||||
if _sftpgo_enabled() and not subject.get("is_admin"):
|
||||
try:
|
||||
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
|
||||
except Exception:
|
||||
pass
|
||||
home = _user_home(user_id)
|
||||
out: dict[str, Any] = {
|
||||
"user_id": user_id,
|
||||
"is_admin": bool(subject.get("is_admin")),
|
||||
"paths": {
|
||||
"home": home,
|
||||
"datasets": f"{home}/datasets",
|
||||
"models": f"{home}/models",
|
||||
"code": f"{home}/code",
|
||||
"jobs": f"{home}/jobs",
|
||||
"trash_jobs": f"{home}/trash/jobs",
|
||||
},
|
||||
"retention": {
|
||||
"jobs_trash_after_days": int(v2_cfg.data.retention.jobs_trash_after_days),
|
||||
"jobs_purge_after_days": int(v2_cfg.data.retention.jobs_purge_after_days),
|
||||
},
|
||||
}
|
||||
if _sftpgo_enabled():
|
||||
out["sftp"] = {
|
||||
"host": v2_cfg.data.sftpgo.host,
|
||||
"port": int(v2_cfg.data.sftpgo.sftp_port),
|
||||
"username": user_id,
|
||||
}
|
||||
return out
|
||||
|
||||
@app.get("/api/v2/tasks")
|
||||
async def list_tasks(req: Request, limit: int = 200, offset: int = 0, states: str | None = None) -> dict[str, Any]:
|
||||
subject = _auth(req)
|
||||
lim = max(1, min(int(limit), 1000))
|
||||
off = max(0, int(offset))
|
||||
state_list: list[str] | None = None
|
||||
if states:
|
||||
raw = [s.strip() for s in str(states).split(",") if s.strip()]
|
||||
# Keep it strict to avoid typos silently returning empty results.
|
||||
allowed = {
|
||||
"QUEUED",
|
||||
"PENDING_RESOURCES",
|
||||
"SUBMITTING",
|
||||
"SUBMITTED",
|
||||
"RUNNING",
|
||||
"SUCCEEDED",
|
||||
"FAILED",
|
||||
"CANCELED",
|
||||
}
|
||||
for s in raw:
|
||||
if s not in allowed:
|
||||
raise HTTPException(status_code=400, detail=f"invalid state: {s}")
|
||||
state_list = raw or None
|
||||
if subject.get("is_admin"):
|
||||
tasks = db.list_tasks(user_id=None, states=state_list, limit=lim, offset=off)
|
||||
else:
|
||||
tasks = db.list_tasks(user_id=str(subject["user_id"]), states=state_list, limit=lim, offset=off)
|
||||
return {"tasks": tasks, "limit": lim, "offset": off, "has_more": bool(len(tasks) == lim)}
|
||||
|
||||
@app.post("/api/v2/tasks")
|
||||
async def submit_task(req: Request) -> dict[str, Any]:
|
||||
subject = _auth(req)
|
||||
@ -126,13 +325,40 @@ def create_app(config_path: str) -> FastAPI:
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=400, detail=f"invalid jobspec: {e!r}")
|
||||
|
||||
# v2.5 constraint: training inputs must come from /private/common (dev/prod统一)。
|
||||
common_prefix = ray_cfg.shared_root.rstrip("/") + "/common/"
|
||||
for k, v in (("code_path", spec.code_path), ("train_file", spec.train_file), ("val_file", spec.val_file)):
|
||||
# v3.0 path policy:
|
||||
# - code_path: only allow /private/common/...
|
||||
# - train/val: allow /private/common/datasets/... OR /private/users/<me>/datasets/...
|
||||
# - model_id: if it looks like a local path (/private/...), allow only models dirs:
|
||||
# /private/common/models/... OR /private/users/<me>/models/...
|
||||
root = ray_cfg.shared_root.rstrip("/")
|
||||
common_prefix = f"{root}/common/"
|
||||
user_prefix = f"{root}/users/{str(subject['user_id']).strip()}/"
|
||||
|
||||
common_datasets_prefix = f"{common_prefix}datasets/"
|
||||
user_datasets_prefix = f"{user_prefix}datasets/"
|
||||
common_models_prefix = f"{common_prefix}models/"
|
||||
user_models_prefix = f"{user_prefix}models/"
|
||||
|
||||
if not str(spec.code_path).startswith(common_prefix):
|
||||
raise HTTPException(status_code=400, detail=f"code_path must start with {common_prefix}")
|
||||
|
||||
for k, v in (("train_file", spec.train_file), ("val_file", spec.val_file)):
|
||||
if v is None:
|
||||
continue
|
||||
if not str(v).startswith(common_prefix):
|
||||
raise HTTPException(status_code=400, detail=f"{k} must start with {common_prefix}")
|
||||
sv = str(v)
|
||||
if not (sv.startswith(common_datasets_prefix) or sv.startswith(user_datasets_prefix)):
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"{k} must start with {common_datasets_prefix} or {user_datasets_prefix}",
|
||||
)
|
||||
|
||||
model_id = str(spec.model_id)
|
||||
if model_id.startswith(f"{root}/"):
|
||||
if not (model_id.startswith(common_models_prefix) or model_id.startswith(user_models_prefix)):
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"model_id local path must start with {common_models_prefix} or {user_models_prefix}",
|
||||
)
|
||||
|
||||
task_id = new_task_id(spec.workload, user_id=str(subject["user_id"]))
|
||||
db.create_task_v25(
|
||||
@ -257,4 +483,7 @@ def create_app(config_path: str) -> FastAPI:
|
||||
return db.list_queue()
|
||||
return db.list_queue(user_id=str(subject["user_id"]))
|
||||
|
||||
# v3.0: minimal WebUI (no server-side session; token stored in browser localStorage).
|
||||
register_ui_routes(app)
|
||||
|
||||
return app
|
||||
|
||||
@ -27,12 +27,37 @@ class V2SchedulerConfig:
|
||||
max_running_tasks: int = 1
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class V2RetentionConfig:
|
||||
jobs_trash_after_days: int = 3
|
||||
jobs_purge_after_days: int = 7
|
||||
janitor_interval_s: int = 3600
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class V2SFTPGoConfig:
|
||||
enabled: bool = False
|
||||
host: str = ""
|
||||
sftp_port: int = 2022
|
||||
admin_api_base: str = ""
|
||||
admin_user: str = "admin"
|
||||
admin_password_env: str = "SFTPGO_ADMIN_PASSWORD"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class V2DataConfig:
|
||||
user_root: str
|
||||
sftpgo: V2SFTPGoConfig
|
||||
retention: V2RetentionConfig
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class V2Config:
|
||||
api: V2ApiConfig
|
||||
auth: V2AuthConfig
|
||||
sqlite: V2SqliteConfig
|
||||
scheduler: V2SchedulerConfig
|
||||
data: V2DataConfig
|
||||
|
||||
@staticmethod
|
||||
def from_root_dict(root: dict[str, Any]) -> "V2Config":
|
||||
@ -58,9 +83,19 @@ class V2Config:
|
||||
else:
|
||||
shared_root = str(root.get("shared_root") or "/private")
|
||||
|
||||
data = root.get("data") or {}
|
||||
if not isinstance(data, dict):
|
||||
raise ValueError("config.data must be a mapping")
|
||||
sftpgo = data.get("sftpgo") or {}
|
||||
retention = data.get("retention") or {}
|
||||
if not isinstance(sftpgo, dict) or not isinstance(retention, dict):
|
||||
raise ValueError("config.data.{sftpgo,retention} must be mappings")
|
||||
|
||||
default_db_path = f"{shared_root}/common/db/mvp.sqlite3"
|
||||
db_path = str(sqlite.get("db_path") or default_db_path)
|
||||
|
||||
user_root = str(data.get("user_root") or f"{shared_root}/users")
|
||||
|
||||
return V2Config(
|
||||
api=V2ApiConfig(
|
||||
host=str(api.get("host") or "0.0.0.0"),
|
||||
@ -73,4 +108,20 @@ class V2Config:
|
||||
retry_interval_s=int(scheduler.get("retry_interval_s") or 60),
|
||||
max_running_tasks=int(scheduler.get("max_running_tasks") or 1),
|
||||
),
|
||||
data=V2DataConfig(
|
||||
user_root=user_root,
|
||||
sftpgo=V2SFTPGoConfig(
|
||||
enabled=bool(sftpgo.get("enabled") or False),
|
||||
host=str(sftpgo.get("host") or ""),
|
||||
sftp_port=int(sftpgo.get("sftp_port") or 2022),
|
||||
admin_api_base=str(sftpgo.get("admin_api_base") or ""),
|
||||
admin_user=str(sftpgo.get("admin_user") or "admin"),
|
||||
admin_password_env=str(sftpgo.get("admin_password_env") or "SFTPGO_ADMIN_PASSWORD"),
|
||||
),
|
||||
retention=V2RetentionConfig(
|
||||
jobs_trash_after_days=int(retention.get("jobs_trash_after_days") or 3),
|
||||
jobs_purge_after_days=int(retention.get("jobs_purge_after_days") or 7),
|
||||
janitor_interval_s=int(retention.get("janitor_interval_s") or 3600),
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
@ -40,7 +40,8 @@ class Db:
|
||||
user_id TEXT PRIMARY KEY,
|
||||
display_name TEXT,
|
||||
state TEXT NOT NULL,
|
||||
created_at TEXT NOT NULL
|
||||
created_at TEXT NOT NULL,
|
||||
updated_at TEXT NOT NULL
|
||||
)
|
||||
"""
|
||||
)
|
||||
@ -49,6 +50,7 @@ class Db:
|
||||
CREATE TABLE IF NOT EXISTS api_tokens (
|
||||
token_hash TEXT PRIMARY KEY,
|
||||
user_id TEXT NOT NULL,
|
||||
token_plain TEXT NOT NULL,
|
||||
created_at TEXT NOT NULL,
|
||||
last_used_at TEXT,
|
||||
FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
|
||||
@ -78,6 +80,15 @@ class Db:
|
||||
conn.execute("ALTER TABLE tasks ADD COLUMN user_id TEXT")
|
||||
except sqlite3.OperationalError:
|
||||
pass
|
||||
# Best-effort: add missing columns (forward compatibility).
|
||||
try:
|
||||
conn.execute("ALTER TABLE users ADD COLUMN updated_at TEXT")
|
||||
except sqlite3.OperationalError:
|
||||
pass
|
||||
try:
|
||||
conn.execute("ALTER TABLE api_tokens ADD COLUMN token_plain TEXT")
|
||||
except sqlite3.OperationalError:
|
||||
pass
|
||||
conn.execute(
|
||||
"""
|
||||
CREATE TABLE IF NOT EXISTS attempts (
|
||||
@ -168,10 +179,10 @@ class Db:
|
||||
with self.tx() as conn:
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO users (user_id, display_name, state, created_at)
|
||||
VALUES (?, ?, 'ACTIVE', ?)
|
||||
INSERT INTO users (user_id, display_name, state, created_at, updated_at)
|
||||
VALUES (?, ?, 'ACTIVE', ?, ?)
|
||||
""",
|
||||
(user_id, display_name, now),
|
||||
(user_id, display_name, now, now),
|
||||
)
|
||||
conn.execute(
|
||||
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_CREATED', ?)",
|
||||
@ -183,7 +194,7 @@ class Db:
|
||||
def disable_user(self, *, user_id: str) -> None:
|
||||
now = _utc_now_iso()
|
||||
with self.tx() as conn:
|
||||
conn.execute("UPDATE users SET state = 'DISABLED' WHERE user_id = ?", (user_id,))
|
||||
conn.execute("UPDATE users SET state = 'DISABLED', updated_at = ? WHERE user_id = ?", (now, user_id))
|
||||
conn.execute(
|
||||
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_DISABLED', ?)",
|
||||
(now, user_id),
|
||||
@ -195,15 +206,18 @@ class Db:
|
||||
return dict(row) if row else None
|
||||
|
||||
def issue_token(self, *, user_id: str) -> str:
|
||||
# Returns plaintext token once; stores hash only.
|
||||
# Returns plaintext token once.
|
||||
# Note: For v3.0 WebUI admin convenience we persist the plaintext token so admins
|
||||
# can re-copy it later. This is acceptable only for internal/dev deployments.
|
||||
now = _utc_now_iso()
|
||||
token = f"mvp_u_{user_id}_{secrets.token_urlsafe(18)}"
|
||||
token_hash = self._hash_token(token)
|
||||
with self.tx() as conn:
|
||||
conn.execute(
|
||||
"INSERT INTO api_tokens (token_hash, user_id, created_at) VALUES (?, ?, ?)",
|
||||
(token_hash, user_id, now),
|
||||
"INSERT INTO api_tokens (token_hash, user_id, token_plain, created_at) VALUES (?, ?, ?, ?)",
|
||||
(token_hash, user_id, token, now),
|
||||
)
|
||||
conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, user_id))
|
||||
conn.execute(
|
||||
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'TOKEN_ISSUED', ?)",
|
||||
(now, user_id),
|
||||
@ -226,8 +240,48 @@ class Db:
|
||||
return None
|
||||
now = _utc_now_iso()
|
||||
conn.execute("UPDATE api_tokens SET last_used_at = ? WHERE token_hash = ?", (now, token_hash))
|
||||
conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, str(row["user_id"])))
|
||||
return {"user_id": row["user_id"], "state": row["state"]}
|
||||
|
||||
def list_users(self, limit: int = 200) -> list[dict[str, Any]]:
|
||||
with self._connect() as conn:
|
||||
rows = conn.execute(
|
||||
"""
|
||||
SELECT
|
||||
u.user_id,
|
||||
u.display_name,
|
||||
u.state,
|
||||
u.created_at,
|
||||
u.updated_at,
|
||||
(
|
||||
SELECT t.token_plain
|
||||
FROM api_tokens t
|
||||
WHERE t.user_id = u.user_id
|
||||
ORDER BY t.created_at DESC
|
||||
LIMIT 1
|
||||
) AS token,
|
||||
(
|
||||
SELECT t.created_at
|
||||
FROM api_tokens t
|
||||
WHERE t.user_id = u.user_id
|
||||
ORDER BY t.created_at DESC
|
||||
LIMIT 1
|
||||
) AS token_created_at,
|
||||
(
|
||||
SELECT t.last_used_at
|
||||
FROM api_tokens t
|
||||
WHERE t.user_id = u.user_id
|
||||
ORDER BY t.created_at DESC
|
||||
LIMIT 1
|
||||
) AS token_last_used_at
|
||||
FROM users u
|
||||
ORDER BY u.created_at DESC
|
||||
LIMIT ?
|
||||
""",
|
||||
(int(limit),),
|
||||
).fetchall()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
def get_task(self, task_id: str) -> dict[str, Any] | None:
|
||||
with self._connect() as conn:
|
||||
row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone()
|
||||
@ -269,6 +323,40 @@ class Db:
|
||||
running = conn.execute(running_sql, tuple(params)).fetchall()
|
||||
return {"pending": [dict(r) for r in pending], "running": [dict(r) for r in running]}
|
||||
|
||||
def list_tasks(
|
||||
self,
|
||||
*,
|
||||
user_id: str | None = None,
|
||||
states: list[str] | None = None,
|
||||
limit: int = 200,
|
||||
offset: int = 0,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Returns recent tasks (including terminal tasks).
|
||||
"""
|
||||
with self._connect() as conn:
|
||||
params: list[Any] = []
|
||||
where_clauses: list[str] = []
|
||||
if user_id is not None:
|
||||
where_clauses.append("user_id = ?")
|
||||
params.append(user_id)
|
||||
if states:
|
||||
placeholders = ",".join(["?"] * len(states))
|
||||
where_clauses.append(f"state IN ({placeholders})")
|
||||
params.extend(states)
|
||||
where_sql = f" WHERE {' AND '.join(where_clauses)}" if where_clauses else ""
|
||||
sql = (
|
||||
"SELECT task_id, user_id, workload, state, nnodes, n_gpus_per_node, latest_attempt_no, created_at, updated_at, error_summary "
|
||||
"FROM tasks"
|
||||
f"{where_sql} "
|
||||
"ORDER BY created_at DESC "
|
||||
"LIMIT ? OFFSET ?"
|
||||
)
|
||||
params.append(int(limit))
|
||||
params.append(max(0, int(offset)))
|
||||
rows = conn.execute(sql, tuple(params)).fetchall()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
def count_running(self) -> int:
|
||||
with self._connect() as conn:
|
||||
row = conn.execute(
|
||||
@ -383,3 +471,25 @@ class Db:
|
||||
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_UPDATE', ?)",
|
||||
(task_id, now, None),
|
||||
)
|
||||
|
||||
def list_ended_attempts_before(self, *, end_time_le: str, limit: int = 2000) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Used by the jobs janitor:
|
||||
- Only considers tasks with a non-null user_id (v2.5+).
|
||||
- Returns attempts that have end_time <= end_time_le.
|
||||
"""
|
||||
with self._connect() as conn:
|
||||
rows = conn.execute(
|
||||
"""
|
||||
SELECT t.task_id, t.user_id, a.ray_submission_id, a.end_time
|
||||
FROM attempts a
|
||||
JOIN tasks t ON t.task_id = a.task_id
|
||||
WHERE t.user_id IS NOT NULL
|
||||
AND a.end_time IS NOT NULL
|
||||
AND a.end_time <= ?
|
||||
ORDER BY a.end_time ASC
|
||||
LIMIT ?
|
||||
""",
|
||||
(str(end_time_le), int(limit)),
|
||||
).fetchall()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
105
src/mvp/py/argus/service/janitor.py
Normal file
105
src/mvp/py/argus/service/janitor.py
Normal file
@ -0,0 +1,105 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import shutil
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from .db import Db
|
||||
|
||||
|
||||
def _parse_iso_z(ts: str) -> datetime:
|
||||
# Stored as "YYYY-MM-DDTHH:MM:SSZ" (naive UTC). Parse into aware UTC.
|
||||
return datetime.fromisoformat(ts.replace("Z", "+00:00")).astimezone(timezone.utc)
|
||||
|
||||
|
||||
def _iso_z(dt: datetime) -> str:
|
||||
return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
@dataclass
|
||||
class JobsJanitor:
|
||||
db: Db
|
||||
user_root: str
|
||||
trash_after_days: int = 3
|
||||
purge_after_days: int = 7
|
||||
interval_s: int = 3600
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
if int(self.trash_after_days) < 0 or int(self.purge_after_days) < 0:
|
||||
raise ValueError("retention days must be non-negative")
|
||||
if int(self.purge_after_days) and int(self.purge_after_days) < int(self.trash_after_days):
|
||||
raise ValueError("purge_after_days must be >= trash_after_days")
|
||||
|
||||
def _job_dir(self, *, user_id: str, ray_submission_id: str) -> str:
|
||||
base = self.user_root.rstrip("/")
|
||||
return f"{base}/{user_id}/jobs/{ray_submission_id}"
|
||||
|
||||
def _trash_dir(self, *, user_id: str, ray_submission_id: str) -> str:
|
||||
base = self.user_root.rstrip("/")
|
||||
return f"{base}/{user_id}/trash/jobs/{ray_submission_id}"
|
||||
|
||||
def tick_once(self, *, now: datetime | None = None, limit: int = 2000) -> None:
|
||||
if int(self.trash_after_days) <= 0 and int(self.purge_after_days) <= 0:
|
||||
return
|
||||
|
||||
now_dt = now or datetime.now(timezone.utc)
|
||||
move_cutoff = now_dt - timedelta(days=int(self.trash_after_days))
|
||||
move_cutoff_iso = _iso_z(move_cutoff)
|
||||
|
||||
rows = self.db.list_ended_attempts_before(end_time_le=move_cutoff_iso, limit=int(limit))
|
||||
for r in rows:
|
||||
user_id = str(r.get("user_id") or "").strip()
|
||||
sid = str(r.get("ray_submission_id") or "").strip()
|
||||
end_time = str(r.get("end_time") or "").strip()
|
||||
if not user_id or not sid or not end_time:
|
||||
continue
|
||||
|
||||
try:
|
||||
ended_at = _parse_iso_z(end_time)
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
age_days = (now_dt - ended_at).total_seconds() / 86400.0
|
||||
src = self._job_dir(user_id=user_id, ray_submission_id=sid)
|
||||
dst = self._trash_dir(user_id=user_id, ray_submission_id=sid)
|
||||
dst_parent = os.path.dirname(dst)
|
||||
|
||||
# Between trash and purge: ensure in trash.
|
||||
if age_days >= float(self.trash_after_days) and age_days < float(self.purge_after_days or 10**9):
|
||||
if os.path.exists(src) and not os.path.exists(dst):
|
||||
os.makedirs(dst_parent, exist_ok=True)
|
||||
try:
|
||||
shutil.move(src, dst)
|
||||
except Exception:
|
||||
pass
|
||||
continue
|
||||
|
||||
# Purge: move to trash (if still in jobs) then delete from trash.
|
||||
if int(self.purge_after_days) > 0 and age_days >= float(self.purge_after_days):
|
||||
if os.path.exists(src) and not os.path.exists(dst):
|
||||
os.makedirs(dst_parent, exist_ok=True)
|
||||
try:
|
||||
shutil.move(src, dst)
|
||||
except Exception:
|
||||
pass
|
||||
if os.path.exists(dst):
|
||||
try:
|
||||
shutil.rmtree(dst)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def run_forever(self, stop_flag: object) -> None:
|
||||
try:
|
||||
is_set = stop_flag.is_set # type: ignore[attr-defined]
|
||||
except Exception:
|
||||
raise ValueError("stop_flag must be a threading.Event-like object with is_set()")
|
||||
|
||||
while not is_set():
|
||||
try:
|
||||
self.tick_once()
|
||||
except Exception:
|
||||
pass
|
||||
time.sleep(max(1, int(self.interval_s)))
|
||||
|
||||
221
src/mvp/py/argus/service/sftpgo.py
Normal file
221
src/mvp/py/argus/service/sftpgo.py
Normal file
@ -0,0 +1,221 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from urllib.error import HTTPError, URLError
|
||||
from urllib.request import Request, urlopen
|
||||
|
||||
|
||||
class SFTPGoError(RuntimeError):
|
||||
pass
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SFTPGoAdminClient:
|
||||
"""
|
||||
Minimal SFTPGo admin API client (v3.0).
|
||||
|
||||
SFTPGo OpenAPI documents the admin token flow:
|
||||
- GET /api/v2/token with HTTP BasicAuth -> returns {"access_token": "..."}.
|
||||
- Use Authorization: Bearer <token> for admin endpoints like POST /api/v2/users.
|
||||
|
||||
See upstream OpenAPI for details:
|
||||
https://raw.githubusercontent.com/drakkan/sftpgo/main/openapi/openapi.yaml
|
||||
"""
|
||||
|
||||
admin_api_base: str
|
||||
admin_user: str
|
||||
admin_password: str
|
||||
common_root: str = "/private/common"
|
||||
|
||||
def _url(self, path: str) -> str:
|
||||
base = self.admin_api_base.rstrip("/")
|
||||
p = path if path.startswith("/") else f"/{path}"
|
||||
return f"{base}{p}"
|
||||
|
||||
def _basic_auth_header(self) -> str:
|
||||
raw = f"{self.admin_user}:{self.admin_password}".encode("utf-8")
|
||||
return "Basic " + base64.b64encode(raw).decode("ascii")
|
||||
|
||||
def _get_json(self, url: str, headers: dict[str, str]) -> dict:
|
||||
req = Request(url, headers=headers, method="GET")
|
||||
try:
|
||||
with urlopen(req, timeout=10) as resp:
|
||||
return json.loads(resp.read().decode("utf-8"))
|
||||
except HTTPError as e:
|
||||
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
|
||||
except URLError as e:
|
||||
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
|
||||
|
||||
def _post_json(self, url: str, payload: dict, headers: dict[str, str]) -> None:
|
||||
data = json.dumps(payload).encode("utf-8")
|
||||
req = Request(url, data=data, headers=headers, method="POST")
|
||||
try:
|
||||
with urlopen(req, timeout=10) as resp:
|
||||
_ = resp.read()
|
||||
except HTTPError as e:
|
||||
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
|
||||
except URLError as e:
|
||||
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
|
||||
|
||||
def _put_json(self, url: str, payload: dict, headers: dict[str, str]) -> None:
|
||||
data = json.dumps(payload).encode("utf-8")
|
||||
req = Request(url, data=data, headers=headers, method="PUT")
|
||||
try:
|
||||
with urlopen(req, timeout=10) as resp:
|
||||
_ = resp.read()
|
||||
except HTTPError as e:
|
||||
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
|
||||
except URLError as e:
|
||||
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
|
||||
|
||||
def _admin_token(self) -> str:
|
||||
url = self._url("/token")
|
||||
obj = self._get_json(url, headers={"Authorization": self._basic_auth_header()})
|
||||
tok = str(obj.get("access_token") or "").strip()
|
||||
if not tok:
|
||||
raise SFTPGoError("sftpgo token response missing access_token")
|
||||
return tok
|
||||
|
||||
def _auth_headers(self, tok: str) -> dict[str, str]:
|
||||
return {"Authorization": f"Bearer {tok}", "Content-Type": "application/json"}
|
||||
|
||||
def _ensure_folder(self, *, tok: str, name: str, mapped_path: str) -> None:
|
||||
url = self._url("/folders")
|
||||
try:
|
||||
self._post_json(url, {"name": name, "mapped_path": mapped_path}, headers=self._auth_headers(tok))
|
||||
except SFTPGoError as e:
|
||||
# Idempotent + self-healing:
|
||||
# If it already exists, update mapped_path to the desired value.
|
||||
if "409" in str(e):
|
||||
self._put_json(
|
||||
self._url(f"/folders/{name}"),
|
||||
{"name": name, "mapped_path": mapped_path},
|
||||
headers=self._auth_headers(tok),
|
||||
)
|
||||
return
|
||||
raise
|
||||
|
||||
def _ensure_common_folders(self, *, tok: str) -> None:
|
||||
# Important: do NOT map datasets to /private/common/datasets because that path is
|
||||
# a symlink farm (e.g. gsm8k -> /private/datasets/gsm8k) and SFTPGo WebClient
|
||||
# will treat symlink traversal as escaping the virtual folder root, resulting
|
||||
# in "permission denied". Map directly to the real datasets root.
|
||||
common = self.common_root.rstrip("/")
|
||||
if common.endswith("/common"):
|
||||
shared_root = common[: -len("/common")] or "/private"
|
||||
else:
|
||||
shared_root = "/private"
|
||||
|
||||
self._ensure_folder(tok=tok, name="common_datasets", mapped_path=f"{shared_root}/datasets")
|
||||
# Expose HF cache read-only so users can inspect downloaded models/datasets.
|
||||
self._ensure_folder(tok=tok, name="common_hf", mapped_path=f"{shared_root}/hf")
|
||||
|
||||
def _get_user(self, *, tok: str, username: str) -> dict:
|
||||
url = self._url(f"/users/{username}")
|
||||
return self._get_json(url, headers={"Authorization": f"Bearer {tok}"})
|
||||
|
||||
def _put_user(self, *, tok: str, username: str, payload: dict) -> None:
|
||||
url = self._url(f"/users/{username}")
|
||||
self._put_json(url, payload, headers=self._auth_headers(tok))
|
||||
|
||||
def _apply_common_readonly(self, user_payload: dict) -> dict:
|
||||
# Path-based permissions: make /common/* read-only while keeping home writeable.
|
||||
perms = dict(user_payload.get("permissions") or {"/": ["*"]})
|
||||
# Ensure /common is visible as a directory and can be traversed.
|
||||
perms["/common"] = ["list"]
|
||||
perms["/common/datasets"] = ["list", "download"]
|
||||
perms["/common/hf"] = ["list", "download"]
|
||||
user_payload["permissions"] = perms
|
||||
|
||||
desired_vf = [
|
||||
{"name": "common_datasets", "virtual_path": "/common/datasets"},
|
||||
{"name": "common_hf", "virtual_path": "/common/hf"},
|
||||
]
|
||||
existing = user_payload.get("virtual_folders") or []
|
||||
if not isinstance(existing, list):
|
||||
existing = []
|
||||
seen = {(vf.get("name"), vf.get("virtual_path")) for vf in existing if isinstance(vf, dict)}
|
||||
merged = list(existing)
|
||||
for vf in desired_vf:
|
||||
key = (vf["name"], vf["virtual_path"])
|
||||
if key not in seen:
|
||||
merged.append(vf)
|
||||
user_payload["virtual_folders"] = merged
|
||||
return user_payload
|
||||
|
||||
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
|
||||
tok = self._admin_token()
|
||||
self._ensure_common_folders(tok=tok)
|
||||
url = self._url("/users")
|
||||
payload: dict = {
|
||||
"status": 1,
|
||||
"username": username,
|
||||
"password": password,
|
||||
"home_dir": home_dir,
|
||||
"permissions": {
|
||||
"/": ["*"],
|
||||
"/common": ["list"],
|
||||
"/common/datasets": ["list", "download"],
|
||||
"/common/hf": ["list", "download"],
|
||||
},
|
||||
"virtual_folders": [
|
||||
{"name": "common_datasets", "virtual_path": "/common/datasets"},
|
||||
{"name": "common_hf", "virtual_path": "/common/hf"},
|
||||
],
|
||||
}
|
||||
self._post_json(
|
||||
url,
|
||||
payload,
|
||||
headers=self._auth_headers(tok),
|
||||
)
|
||||
|
||||
def disable_user(self, *, username: str, home_dir: str) -> None:
|
||||
tok = self._admin_token()
|
||||
self._ensure_common_folders(tok=tok)
|
||||
cur = self._get_user(tok=tok, username=username)
|
||||
payload: dict = {
|
||||
"username": username,
|
||||
"status": 0,
|
||||
"home_dir": home_dir,
|
||||
"uid": cur.get("uid", 0),
|
||||
"gid": cur.get("gid", 0),
|
||||
"permissions": cur.get("permissions") or {"/": ["*"]},
|
||||
"virtual_folders": cur.get("virtual_folders") or [],
|
||||
}
|
||||
self._apply_common_readonly(payload)
|
||||
self._put_user(tok=tok, username=username, payload=payload)
|
||||
|
||||
def enable_user(self, *, username: str, home_dir: str) -> None:
|
||||
tok = self._admin_token()
|
||||
self._ensure_common_folders(tok=tok)
|
||||
cur = self._get_user(tok=tok, username=username)
|
||||
payload: dict = {
|
||||
"username": username,
|
||||
"status": 1,
|
||||
"home_dir": home_dir,
|
||||
"uid": cur.get("uid", 0),
|
||||
"gid": cur.get("gid", 0),
|
||||
"permissions": cur.get("permissions") or {"/": ["*"]},
|
||||
"virtual_folders": cur.get("virtual_folders") or [],
|
||||
}
|
||||
self._apply_common_readonly(payload)
|
||||
self._put_user(tok=tok, username=username, payload=payload)
|
||||
|
||||
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
|
||||
tok = self._admin_token()
|
||||
self._ensure_common_folders(tok=tok)
|
||||
cur = self._get_user(tok=tok, username=username)
|
||||
payload: dict = {
|
||||
"username": username,
|
||||
"status": 1,
|
||||
"home_dir": home_dir,
|
||||
"uid": cur.get("uid", 0),
|
||||
"gid": cur.get("gid", 0),
|
||||
"permissions": cur.get("permissions") or {"/": ["*"]},
|
||||
"virtual_folders": cur.get("virtual_folders") or [],
|
||||
"password": new_password,
|
||||
}
|
||||
self._apply_common_readonly(payload)
|
||||
self._put_user(tok=tok, username=username, payload=payload)
|
||||
629
src/mvp/py/argus/service/ui.py
Normal file
629
src/mvp/py/argus/service/ui.py
Normal file
@ -0,0 +1,629 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import json
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi.responses import HTMLResponse, RedirectResponse
|
||||
|
||||
|
||||
_BASE_CSS = """
|
||||
:root { --bg:#0b1020; --panel:#111a33; --muted:#95a3c6; --fg:#e8eeff; --accent:#7aa2ff; --danger:#ff6b6b; --ok:#3ddc97; }
|
||||
* { box-sizing: border-box; }
|
||||
body { margin:0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background:var(--bg); color:var(--fg); }
|
||||
a { color:var(--accent); text-decoration:none; }
|
||||
.layout { display:flex; min-height:100vh; }
|
||||
.nav { width: 240px; padding:16px; background: linear-gradient(180deg, #0e1630, #0b1020); border-right: 1px solid rgba(255,255,255,0.06); }
|
||||
.brand { font-weight: 700; letter-spacing: .2px; margin-bottom: 12px; }
|
||||
.nav a { display:block; padding:10px 10px; border-radius:10px; color: var(--fg); opacity: .9; }
|
||||
.nav a.active { background: rgba(122,162,255,0.14); border: 1px solid rgba(122,162,255,0.22); }
|
||||
.nav a:hover { background: rgba(255,255,255,0.06); }
|
||||
.main { flex:1; padding: 20px 24px; }
|
||||
.card { background: rgba(255,255,255,0.04); border: 1px solid rgba(255,255,255,0.08); border-radius: 14px; padding: 16px; }
|
||||
.row { display:flex; gap: 12px; align-items:center; flex-wrap: wrap; }
|
||||
.muted { color: var(--muted); }
|
||||
.btn { border: 1px solid rgba(255,255,255,0.16); background: rgba(255,255,255,0.06); color: var(--fg); padding: 10px 12px; border-radius: 10px; cursor: pointer; }
|
||||
.btn:hover { background: rgba(255,255,255,0.10); }
|
||||
.btn.danger { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.10); }
|
||||
.pill { display:inline-block; padding: 2px 10px; border-radius: 999px; border: 1px solid rgba(255,255,255,0.16); font-size: 12px; }
|
||||
.pill.ok { border-color: rgba(61,220,151,0.35); background: rgba(61,220,151,0.12); }
|
||||
.pill.bad { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.12); }
|
||||
textarea, input { width: 100%; color: var(--fg); background: rgba(255,255,255,0.06); border: 1px solid rgba(255,255,255,0.12); border-radius: 12px; padding: 10px 12px; outline: none; }
|
||||
button:disabled { opacity: .45; cursor: not-allowed; }
|
||||
pre { white-space: pre-wrap; word-break: break-word; }
|
||||
table { width:100%; border-collapse: collapse; }
|
||||
th, td { padding: 10px 8px; border-bottom: 1px solid rgba(255,255,255,0.08); text-align:left; }
|
||||
""".strip()
|
||||
|
||||
|
||||
_BASE_JS = """
|
||||
function mvpTokenGet() {
|
||||
return (localStorage.getItem("mvp_token") || "").trim();
|
||||
}
|
||||
function mvpTokenSet(v) {
|
||||
localStorage.setItem("mvp_token", (v || "").trim());
|
||||
}
|
||||
function mvpSftpPasswordGet() {
|
||||
return (localStorage.getItem("mvp_sftp_password") || "").trim();
|
||||
}
|
||||
function mvpSftpPasswordSet(v) {
|
||||
localStorage.setItem("mvp_sftp_password", (v || "").trim());
|
||||
}
|
||||
async function apiFetch(path, opts) {
|
||||
opts = opts || {};
|
||||
opts.headers = opts.headers || {};
|
||||
const tok = mvpTokenGet();
|
||||
if (tok) opts.headers["Authorization"] = "Bearer " + tok;
|
||||
return fetch(path, opts);
|
||||
}
|
||||
async function apiJson(path, opts) {
|
||||
const resp = await apiFetch(path, opts);
|
||||
const text = await resp.text();
|
||||
if (!resp.ok) {
|
||||
const err = new Error("HTTP " + resp.status);
|
||||
err.status = resp.status;
|
||||
err.body = text;
|
||||
throw err;
|
||||
}
|
||||
return JSON.parse(text);
|
||||
}
|
||||
function fmtJson(obj) {
|
||||
try { return JSON.stringify(obj, null, 2); } catch (e) { return String(obj); }
|
||||
}
|
||||
function curOriginWithPort(port) {
|
||||
const proto = window.location.protocol;
|
||||
const host = window.location.hostname;
|
||||
return proto + "//" + host + ":" + port;
|
||||
}
|
||||
async function copyText(v) {
|
||||
if (!v) return false;
|
||||
try {
|
||||
await navigator.clipboard.writeText(v);
|
||||
return true;
|
||||
} catch (e) {
|
||||
// Fallback for non-secure contexts (http) or older browsers.
|
||||
try {
|
||||
const ta = document.createElement("textarea");
|
||||
ta.value = v;
|
||||
ta.style.position = "fixed";
|
||||
ta.style.opacity = "0";
|
||||
document.body.appendChild(ta);
|
||||
ta.focus();
|
||||
ta.select();
|
||||
const ok = document.execCommand("copy");
|
||||
document.body.removeChild(ta);
|
||||
return ok;
|
||||
} catch (e2) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
document.addEventListener("DOMContentLoaded", () => {
|
||||
const el = document.getElementById("nav-ray-dashboard");
|
||||
if (el) el.href = curOriginWithPort(8265);
|
||||
});
|
||||
""".strip()
|
||||
|
||||
|
||||
def _nav(active: str) -> str:
|
||||
links = [
|
||||
("login", "/ui/login", "Login"),
|
||||
("tasks", "/ui/tasks", "Tasks"),
|
||||
("new", "/ui/tasks/new", "New Task"),
|
||||
("data", "/ui/data", "Data"),
|
||||
("admin", "/ui/admin", "Admin"),
|
||||
("ray", "#", "Ray Dashboard"),
|
||||
]
|
||||
items = []
|
||||
for key, href, label in links:
|
||||
cls = "active" if key == active else ""
|
||||
extra = ""
|
||||
if key == "ray":
|
||||
extra = ' id="nav-ray-dashboard" target="_blank" rel="noopener"'
|
||||
items.append(f'<a class="{cls}" href="{href}"{extra}>{html.escape(label)}</a>')
|
||||
return "\n".join(items)
|
||||
|
||||
|
||||
def _page(title: str, active: str, body: str, script: str = "") -> str:
|
||||
return f"""<!doctype html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>{html.escape(title)}</title>
|
||||
<style>{_BASE_CSS}</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="layout">
|
||||
<nav class="nav">
|
||||
<div class="brand">Argus MVP</div>
|
||||
{_nav(active)}
|
||||
<div style="margin-top:12px" class="muted">
|
||||
Token stored in browser localStorage as <code>mvp_token</code>.
|
||||
</div>
|
||||
</nav>
|
||||
<main class="main">
|
||||
{body}
|
||||
</main>
|
||||
</div>
|
||||
<script>{_BASE_JS}</script>
|
||||
<script>{script}</script>
|
||||
</body>
|
||||
</html>"""
|
||||
|
||||
|
||||
def register_ui_routes(app: FastAPI) -> None:
|
||||
@app.get("/ui")
|
||||
async def ui_root() -> RedirectResponse:
|
||||
return RedirectResponse(url="/ui/tasks")
|
||||
|
||||
@app.get("/ui/login")
|
||||
async def ui_login() -> HTMLResponse:
|
||||
body = """
|
||||
<h1>Login</h1>
|
||||
<div class="card">
|
||||
<div class="muted">Paste your API token (without the <code>Bearer</code> prefix).</div>
|
||||
<div style="height:10px"></div>
|
||||
<input id="tok" placeholder="token..." />
|
||||
<div style="height:10px"></div>
|
||||
<div class="row">
|
||||
<button class="btn" id="save">Save</button>
|
||||
<button class="btn" id="clear">Clear</button>
|
||||
<a href="/ui/tasks" class="btn" style="display:inline-block">Go to Tasks</a>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<div id="msg" class="muted"></div>
|
||||
</div>
|
||||
""".strip()
|
||||
script = """
|
||||
const tokEl = document.getElementById("tok");
|
||||
const msg = document.getElementById("msg");
|
||||
tokEl.value = mvpTokenGet();
|
||||
document.getElementById("save").onclick = () => { mvpTokenSet(tokEl.value); msg.textContent = "Saved."; };
|
||||
document.getElementById("clear").onclick = () => { mvpTokenSet(""); tokEl.value = ""; msg.textContent = "Cleared."; };
|
||||
""".strip()
|
||||
return HTMLResponse(content=_page("Login", "login", body, script))
|
||||
|
||||
@app.get("/ui/tasks")
|
||||
async def ui_tasks() -> HTMLResponse:
|
||||
body = """
|
||||
<h1>Tasks</h1>
|
||||
<div class="card">
|
||||
<div class="row">
|
||||
<button class="btn" id="refresh">Refresh</button>
|
||||
<a class="btn" href="/ui/tasks/new" style="display:inline-block">New Task</a>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<div id="out" class="muted">Loading...</div>
|
||||
</div>
|
||||
""".strip()
|
||||
script = """
|
||||
const out = document.getElementById("out");
|
||||
async function refresh() {
|
||||
out.textContent = "Loading...";
|
||||
try {
|
||||
const q = await apiJson("/api/v2/queue");
|
||||
const completedLimit = 25;
|
||||
const completedOffset = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
|
||||
const done = await apiJson("/api/v2/tasks?limit=" + completedLimit + "&offset=" + completedOffset + "&states=SUCCEEDED,FAILED,CANCELED");
|
||||
|
||||
function pill(state) {
|
||||
const s = String(state || "");
|
||||
if (s === "SUCCEEDED") return `<span class="pill ok">${s}</span>`;
|
||||
if (s === "FAILED") return `<span class="pill bad">${s}</span>`;
|
||||
if (s === "CANCELED") return `<span class="pill">${s}</span>`;
|
||||
if (s === "RUNNING") return `<span class="pill ok">${s}</span>`;
|
||||
if (s === "QUEUED" || s === "PENDING_RESOURCES" || s === "SUBMITTING" || s === "SUBMITTED") return `<span class="pill">${s}</span>`;
|
||||
return `<span class="pill">${s}</span>`;
|
||||
}
|
||||
function row(t) {
|
||||
const id = t.task_id;
|
||||
return `<tr>
|
||||
<td><a href="/ui/tasks/${id}">${id}</a></td>
|
||||
<td>${t.workload}</td>
|
||||
<td>${pill(t.state)}</td>
|
||||
<td>${t.nnodes} x ${t.n_gpus_per_node} GPU</td>
|
||||
<td>${t.updated_at || ""}</td>
|
||||
</tr>`;
|
||||
}
|
||||
|
||||
const running = (q.running || []).map(row).join("");
|
||||
const pending = (q.pending || []).map(row).join("");
|
||||
const doneRows = (done.tasks || []).map(row).join("");
|
||||
const pageNo = Math.floor(completedOffset / completedLimit) + 1;
|
||||
const prevDisabled = completedOffset <= 0;
|
||||
const nextDisabled = !done.has_more;
|
||||
out.innerHTML = `
|
||||
<div class="muted">Tip: configure token in <a href="/ui/login">Login</a>.</div>
|
||||
<div style="height:10px"></div>
|
||||
<h3>Running</h3>
|
||||
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${running || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
|
||||
<div style="height:12px"></div>
|
||||
<h3>Pending</h3>
|
||||
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${pending || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
|
||||
<div style="height:12px"></div>
|
||||
<h3>Completed</h3>
|
||||
<div class="row" style="justify-content: space-between; margin-bottom: 8px;">
|
||||
<div class="muted">Page ${pageNo}</div>
|
||||
<div class="row">
|
||||
<button class="btn" id="done-prev" ${prevDisabled ? "disabled" : ""}>Prev</button>
|
||||
<button class="btn" id="done-next" ${nextDisabled ? "disabled" : ""}>Next</button>
|
||||
</div>
|
||||
</div>
|
||||
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${doneRows || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
|
||||
`;
|
||||
|
||||
const prevBtn = document.getElementById("done-prev");
|
||||
const nextBtn = document.getElementById("done-next");
|
||||
if (prevBtn) prevBtn.onclick = () => {
|
||||
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
|
||||
const next = Math.max(0, cur - completedLimit);
|
||||
localStorage.setItem("mvp_completed_offset", String(next));
|
||||
refresh();
|
||||
};
|
||||
if (nextBtn) nextBtn.onclick = () => {
|
||||
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
|
||||
const next = cur + completedLimit;
|
||||
localStorage.setItem("mvp_completed_offset", String(next));
|
||||
refresh();
|
||||
};
|
||||
} catch (e) {
|
||||
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
|
||||
}
|
||||
}
|
||||
document.getElementById("refresh").onclick = refresh;
|
||||
refresh();
|
||||
""".strip()
|
||||
return HTMLResponse(content=_page("Tasks", "tasks", body, script))
|
||||
|
||||
@app.get("/ui/tasks/new")
|
||||
async def ui_new_task() -> HTMLResponse:
|
||||
ppo = """# PPO TaskSpec (YAML)
|
||||
workload: ppo
|
||||
nnodes: 2
|
||||
n_gpus_per_node: 4
|
||||
code_path: /private/common/code/verl/verl_repo
|
||||
train_file: /private/common/datasets/gsm8k/train.parquet
|
||||
val_file: /private/common/datasets/gsm8k/test.parquet
|
||||
model_id: Qwen/Qwen2.5-0.5B-Instruct
|
||||
""".strip()
|
||||
grpo = """# GRPO TaskSpec (YAML)
|
||||
workload: grpo
|
||||
nnodes: 2
|
||||
n_gpus_per_node: 4
|
||||
code_path: /private/common/code/verl/verl_repo
|
||||
train_file: /private/common/datasets/gsm8k/train.parquet
|
||||
val_file: /private/common/datasets/gsm8k/test.parquet
|
||||
model_id: Qwen/Qwen2.5-0.5B-Instruct
|
||||
""".strip()
|
||||
sft = """# SFT TaskSpec (YAML)
|
||||
workload: sft
|
||||
nnodes: 1
|
||||
n_gpus_per_node: 1
|
||||
code_path: /private/common/code/verl/verl_repo
|
||||
train_file: /private/common/datasets/gsm8k_sft/train.parquet
|
||||
val_file: /private/common/datasets/gsm8k_sft/test.parquet
|
||||
model_id: Qwen/Qwen2.5-0.5B-Instruct
|
||||
""".strip()
|
||||
body = f"""
|
||||
<h1>New Task</h1>
|
||||
<div class="card">
|
||||
<div class="muted">Paste TaskSpec YAML and submit to API server. Note: <code>code_path</code> is required (v3.0 does not execute user code; use the common snapshot).</div>
|
||||
<div style="height:10px"></div>
|
||||
<div class="row">
|
||||
<button class="btn" id="tpl-ppo">PPO example</button>
|
||||
<button class="btn" id="tpl-grpo">GRPO example</button>
|
||||
<button class="btn" id="tpl-sft">SFT example</button>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<textarea id="yaml" rows="16">{html.escape(ppo)}</textarea>
|
||||
<div style="height:10px"></div>
|
||||
<div class="row">
|
||||
<button class="btn" id="submit">Submit</button>
|
||||
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<pre id="msg" class="muted"></pre>
|
||||
</div>
|
||||
""".strip()
|
||||
tpl_ppo = json.dumps(ppo)
|
||||
tpl_grpo = json.dumps(grpo)
|
||||
tpl_sft = json.dumps(sft)
|
||||
script = (
|
||||
"""
|
||||
const msg = document.getElementById("msg");
|
||||
const yamlEl = document.getElementById("yaml");
|
||||
const TPL_PPO = __TPL_PPO__;
|
||||
const TPL_GRPO = __TPL_GRPO__;
|
||||
const TPL_SFT = __TPL_SFT__;
|
||||
document.getElementById("tpl-ppo").onclick = () => { yamlEl.value = TPL_PPO; msg.textContent = ""; };
|
||||
document.getElementById("tpl-grpo").onclick = () => { yamlEl.value = TPL_GRPO; msg.textContent = ""; };
|
||||
document.getElementById("tpl-sft").onclick = () => { yamlEl.value = TPL_SFT; msg.textContent = ""; };
|
||||
document.getElementById("submit").onclick = async () => {
|
||||
msg.textContent = "Submitting...";
|
||||
const body = yamlEl.value;
|
||||
const resp = await apiFetch("/api/v2/tasks", { method: "POST", headers: {"Content-Type":"text/plain"}, body });
|
||||
const text = await resp.text();
|
||||
if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; }
|
||||
const obj = JSON.parse(text);
|
||||
msg.textContent = "OK: " + fmtJson(obj);
|
||||
if (obj.task_id) window.location.href = "/ui/tasks/" + obj.task_id;
|
||||
};
|
||||
""".strip()
|
||||
.replace("__TPL_PPO__", tpl_ppo)
|
||||
.replace("__TPL_GRPO__", tpl_grpo)
|
||||
.replace("__TPL_SFT__", tpl_sft)
|
||||
)
|
||||
return HTMLResponse(content=_page("New Task", "new", body, script))
|
||||
|
||||
@app.get("/ui/tasks/{task_id}")
|
||||
async def ui_task_detail(task_id: str) -> HTMLResponse:
|
||||
safe_id = html.escape(task_id)
|
||||
body = f"""
|
||||
<h1>Task: <code>{safe_id}</code></h1>
|
||||
<div class="card">
|
||||
<div class="row">
|
||||
<a class="btn" href="/ui/tasks/{safe_id}/logs" style="display:inline-block">Logs</a>
|
||||
<button class="btn" id="refresh">Refresh</button>
|
||||
<button class="btn danger" id="cancel">Cancel</button>
|
||||
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<pre id="out" class="muted">Loading...</pre>
|
||||
</div>
|
||||
""".strip()
|
||||
script = f"""
|
||||
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
|
||||
const out = document.getElementById("out");
|
||||
async function refresh() {{
|
||||
out.textContent = "Loading...";
|
||||
const resp = await apiFetch("/api/v2/tasks/{task_id}");
|
||||
const text = await resp.text();
|
||||
if (!resp.ok) {{ out.textContent = "Error: " + resp.status + "\\n" + text; return; }}
|
||||
out.textContent = fmtJson(JSON.parse(text));
|
||||
}}
|
||||
document.getElementById("refresh").onclick = refresh;
|
||||
document.getElementById("cancel").onclick = async () => {{
|
||||
if (!confirm("Cancel this task?")) return;
|
||||
const resp = await apiFetch("/api/v2/tasks/{task_id}:cancel", {{ method: "POST" }});
|
||||
const text = await resp.text();
|
||||
out.textContent = (resp.ok ? "Canceled.\\n" : "Error: " + resp.status + "\\n") + text;
|
||||
setTimeout(refresh, 800);
|
||||
}};
|
||||
refresh();
|
||||
""".strip()
|
||||
return HTMLResponse(content=_page(f"Task {task_id}", "tasks", body, script))
|
||||
|
||||
@app.get("/ui/tasks/{task_id}/logs")
|
||||
async def ui_task_logs(task_id: str) -> HTMLResponse:
|
||||
safe_id = html.escape(task_id)
|
||||
body = f"""
|
||||
<h1>Logs: <code>{safe_id}</code></h1>
|
||||
<div class="card">
|
||||
<div class="row">
|
||||
<button class="btn" id="refresh">Refresh</button>
|
||||
<label class="muted">Auto refresh <input type="checkbox" id="auto" /></label>
|
||||
<a class="btn" href="/ui/tasks/{safe_id}" style="display:inline-block">Back</a>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<pre id="out" class="muted">Loading...</pre>
|
||||
</div>
|
||||
""".strip()
|
||||
script = f"""
|
||||
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
|
||||
const out = document.getElementById("out");
|
||||
let timer = null;
|
||||
async function refresh() {{
|
||||
const resp = await apiFetch("/api/v2/tasks/{task_id}/logs?tail=4000");
|
||||
const text = await resp.text();
|
||||
out.textContent = resp.ok ? text : ("Error: " + resp.status + "\\n" + text);
|
||||
}}
|
||||
document.getElementById("refresh").onclick = refresh;
|
||||
document.getElementById("auto").onchange = (e) => {{
|
||||
if (e.target.checked) {{
|
||||
timer = setInterval(refresh, 2000);
|
||||
}} else {{
|
||||
if (timer) clearInterval(timer);
|
||||
timer = null;
|
||||
}}
|
||||
}};
|
||||
refresh();
|
||||
""".strip()
|
||||
return HTMLResponse(content=_page(f"Logs {task_id}", "tasks", body, script))
|
||||
|
||||
@app.get("/ui/data")
|
||||
async def ui_data() -> HTMLResponse:
|
||||
body = """
|
||||
<h1>Data</h1>
|
||||
<div class="card">
|
||||
<div class="muted">User files live under your home directory. Keep long-term artifacts in <code>models/</code> or <code>datasets/</code>.</div>
|
||||
<div style="height:14px"></div>
|
||||
|
||||
<div class="row">
|
||||
<div style="flex:1; min-width:260px">
|
||||
<div class="muted">Username</div>
|
||||
<div style="height:6px"></div>
|
||||
<div class="row" style="gap:8px">
|
||||
<input id="u" readonly />
|
||||
<button class="btn" id="copy-u">Copy</button>
|
||||
</div>
|
||||
</div>
|
||||
<div style="flex:1; min-width:260px">
|
||||
<div class="muted">SFTPGo password</div>
|
||||
<div style="height:6px"></div>
|
||||
<div class="row" style="gap:8px">
|
||||
<input id="p" placeholder="Click Reset to generate..." />
|
||||
<button class="btn" id="copy-p">Copy</button>
|
||||
<button class="btn" id="reset-p">Reset</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div style="height:12px"></div>
|
||||
<div class="row">
|
||||
<a class="btn" id="sftp-web" target="_blank" rel="noopener" href="#">Open SFTPGo Web Client (:8081)</a>
|
||||
</div>
|
||||
|
||||
<div style="height:12px"></div>
|
||||
<div class="muted">
|
||||
You can also use an SFTP client (e.g. FileZilla) with the same username/password.
|
||||
Host: <code id="sftp-host"></code>, Port: <code id="sftp-port"></code>.
|
||||
</div>
|
||||
|
||||
<div style="height:14px"></div>
|
||||
<pre id="out" class="muted">Loading...</pre>
|
||||
</div>
|
||||
""".strip()
|
||||
script = """
|
||||
const out = document.getElementById("out");
|
||||
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
|
||||
const u = document.getElementById("u");
|
||||
const p = document.getElementById("p");
|
||||
const sftpWeb = document.getElementById("sftp-web");
|
||||
const sftpHost = document.getElementById("sftp-host");
|
||||
const sftpPort = document.getElementById("sftp-port");
|
||||
document.getElementById("copy-u").onclick = async () => { await copyText(u.value || ""); };
|
||||
document.getElementById("copy-p").onclick = async () => { await copyText(p.value || ""); };
|
||||
|
||||
async function refresh() {
|
||||
const resp = await apiFetch("/api/v2/me");
|
||||
const text = await resp.text();
|
||||
if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; }
|
||||
const obj = JSON.parse(text);
|
||||
u.value = (obj.user_id || "");
|
||||
const cached = mvpSftpPasswordGet();
|
||||
if (cached) p.value = cached;
|
||||
const host = curOriginWithPort(8081);
|
||||
sftpWeb.href = host + "/web/client";
|
||||
sftpHost.textContent = (obj.sftp && obj.sftp.host) ? obj.sftp.host : window.location.hostname;
|
||||
sftpPort.textContent = (obj.sftp && obj.sftp.port) ? String(obj.sftp.port) : "2022";
|
||||
out.textContent = fmtJson(obj);
|
||||
}
|
||||
document.getElementById("reset-p").onclick = async () => {
|
||||
p.value = "";
|
||||
const resp = await apiFetch("/api/v2/me/sftp:reset_password", { method: "POST" });
|
||||
const text = await resp.text();
|
||||
if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; }
|
||||
const obj = JSON.parse(text);
|
||||
p.value = obj.password || "";
|
||||
mvpSftpPasswordSet(p.value);
|
||||
out.textContent = "SFTPGo password rotated.\\n\\n" + fmtJson(obj);
|
||||
};
|
||||
refresh();
|
||||
""".strip()
|
||||
return HTMLResponse(content=_page("Data", "data", body, script))
|
||||
|
||||
@app.get("/ui/admin")
|
||||
async def ui_admin() -> HTMLResponse:
|
||||
body = """
|
||||
<h1>Admin</h1>
|
||||
<div class="card">
|
||||
<div class="muted">
|
||||
This page requires the <code>admin</code> token (set it in <a href="/ui/login">Login</a>).
|
||||
</div>
|
||||
<div style="height:14px"></div>
|
||||
|
||||
<h3>Create user</h3>
|
||||
<div class="row">
|
||||
<input id="new-user-id" placeholder="user_id (e.g. alice)" style="max-width:320px" />
|
||||
<input id="new-display-name" placeholder="display_name (optional)" style="max-width:320px" />
|
||||
<button class="btn" id="create-user">Create</button>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<pre id="create-msg" class="muted"></pre>
|
||||
|
||||
<div style="height:14px"></div>
|
||||
<div class="row">
|
||||
<button class="btn" id="refresh">Refresh</button>
|
||||
</div>
|
||||
<div style="height:10px"></div>
|
||||
<div id="out" class="muted">Loading...</div>
|
||||
</div>
|
||||
""".strip()
|
||||
script = """
|
||||
const out = document.getElementById("out");
|
||||
const createMsg = document.getElementById("create-msg");
|
||||
const userIdEl = document.getElementById("new-user-id");
|
||||
const displayNameEl = document.getElementById("new-display-name");
|
||||
|
||||
function esc(s) {
|
||||
s = String(s || "");
|
||||
return s.replaceAll("&","&").replaceAll("<","<").replaceAll(">",">");
|
||||
}
|
||||
|
||||
async function refresh() {
|
||||
out.textContent = "Loading...";
|
||||
try {
|
||||
const obj = await apiJson("/api/v2/users?limit=200");
|
||||
const users = (obj.users || []);
|
||||
function row(u) {
|
||||
const uid = u.user_id;
|
||||
const tok = u.token || "";
|
||||
const tokShort = tok ? (tok.length > 18 ? (tok.slice(0, 18) + "…") : tok) : "";
|
||||
const created = u.created_at || "";
|
||||
const updated = u.updated_at || "";
|
||||
const tCreated = u.token_created_at || "";
|
||||
const tUsed = u.token_last_used_at || "";
|
||||
return `<tr>
|
||||
<td><code>${esc(uid)}</code></td>
|
||||
<td class="muted">${esc(created)}</td>
|
||||
<td class="muted">${esc(updated)}</td>
|
||||
<td>
|
||||
<div class="row" style="gap:8px">
|
||||
<code title="${esc(tok)}">${esc(tokShort)}</code>
|
||||
<button class="btn" data-copy="${esc(tok)}">Copy</button>
|
||||
<button class="btn" data-issue="${esc(uid)}">Issue token</button>
|
||||
</div>
|
||||
<div class="muted" style="margin-top:6px">token_created_at: ${esc(tCreated)}; last_used_at: ${esc(tUsed)}</div>
|
||||
</td>
|
||||
</tr>`;
|
||||
}
|
||||
out.innerHTML = `
|
||||
<table>
|
||||
<thead><tr><th>User</th><th>Created</th><th>Updated</th><th>Token</th></tr></thead>
|
||||
<tbody>${users.map(row).join("") || "<tr><td colspan=4 class=muted>(none)</td></tr>"}</tbody>
|
||||
</table>
|
||||
`;
|
||||
for (const btn of out.querySelectorAll("button[data-copy]")) {
|
||||
btn.onclick = async () => { await copyText(btn.getAttribute("data-copy") || ""); };
|
||||
}
|
||||
for (const btn of out.querySelectorAll("button[data-issue]")) {
|
||||
btn.onclick = async () => {
|
||||
const uid = btn.getAttribute("data-issue");
|
||||
if (!uid) return;
|
||||
try {
|
||||
const r = await apiJson("/api/v2/users/" + encodeURIComponent(uid) + "/tokens", { method: "POST" });
|
||||
createMsg.textContent = "Issued token:\\n" + fmtJson(r);
|
||||
await refresh();
|
||||
} catch (e) {
|
||||
createMsg.textContent = "Error issuing token: " + (e.status || "") + "\\n" + (e.body || String(e));
|
||||
}
|
||||
};
|
||||
}
|
||||
} catch (e) {
|
||||
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
|
||||
}
|
||||
}
|
||||
|
||||
document.getElementById("refresh").onclick = refresh;
|
||||
document.getElementById("create-user").onclick = async () => {
|
||||
createMsg.textContent = "Creating...";
|
||||
const user_id = (userIdEl.value || "").trim();
|
||||
const display_name = (displayNameEl.value || "").trim();
|
||||
if (!user_id) { createMsg.textContent = "user_id is required"; return; }
|
||||
const payload = { user_id: user_id };
|
||||
if (display_name) payload.display_name = display_name;
|
||||
try {
|
||||
const r = await apiJson("/api/v2/users", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(payload) });
|
||||
createMsg.textContent = "Created:\\n" + fmtJson(r);
|
||||
userIdEl.value = "";
|
||||
displayNameEl.value = "";
|
||||
await refresh();
|
||||
} catch (e) {
|
||||
createMsg.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
|
||||
}
|
||||
};
|
||||
|
||||
refresh();
|
||||
""".strip()
|
||||
return HTMLResponse(content=_page("Admin", "admin", body, script))
|
||||
@ -16,6 +16,11 @@ def _write_config(tmp_path: Path) -> Path:
|
||||
"entrypoint_resources": {"worker_node": 1},
|
||||
"runtime_env": {"env_vars": {}},
|
||||
},
|
||||
"data": {
|
||||
# Avoid touching real /private in tests. Keep ray.shared_root as /private
|
||||
# so existing path validation tests remain unchanged.
|
||||
"user_root": str(tmp_path / "users"),
|
||||
},
|
||||
"service": {
|
||||
"api": {"host": "127.0.0.1", "port": 0},
|
||||
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
|
||||
@ -95,6 +100,17 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch):
|
||||
assert r3.status_code == 200
|
||||
assert "pending" in r3.json()
|
||||
|
||||
r3b = c.get("/api/v2/tasks?limit=10", headers=headers)
|
||||
assert r3b.status_code == 200
|
||||
assert any(t.get("task_id") == "tid1" for t in r3b.json().get("tasks", []))
|
||||
|
||||
r3c = c.get("/api/v2/tasks?limit=10&offset=0&states=QUEUED", headers=headers)
|
||||
assert r3c.status_code == 200
|
||||
assert all(t.get("state") == "QUEUED" for t in r3c.json().get("tasks", []))
|
||||
|
||||
r3d = c.get("/api/v2/tasks?states=NOPE", headers=headers)
|
||||
assert r3d.status_code == 400
|
||||
|
||||
r4 = c.post("/api/v2/tasks/tid1:cancel", headers=headers)
|
||||
assert r4.status_code == 200
|
||||
assert r4.json()["state"] == "CANCELED"
|
||||
@ -118,6 +134,14 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch):
|
||||
db.create_attempt(task_id="tid2", attempt_no=1, ray_submission_id="sid2")
|
||||
db.set_task_state(task_id="tid2", state="RUNNING", latest_attempt_no=1)
|
||||
|
||||
r6 = c.get("/api/v2/tasks?limit=1&offset=0&states=RUNNING", headers=headers)
|
||||
assert r6.status_code == 200
|
||||
assert any(t.get("task_id") == "tid2" for t in r6.json().get("tasks", []))
|
||||
|
||||
r7 = c.get("/api/v2/tasks?limit=1&offset=1&states=RUNNING", headers=headers)
|
||||
assert r7.status_code == 200
|
||||
assert "has_more" in r7.json()
|
||||
|
||||
r5 = c.get("/api/v2/tasks/tid2/logs?tail=1", headers=headers)
|
||||
assert r5.status_code == 200
|
||||
assert r5.text.strip() == "c"
|
||||
@ -163,3 +187,102 @@ def test_submit_rejects_invalid_jobspec(tmp_path: Path, monkeypatch):
|
||||
with TestClient(app) as c:
|
||||
r = c.post("/api/v2/tasks", headers={"authorization": "Bearer token1"}, data="workload: nope\n")
|
||||
assert r.status_code == 400
|
||||
|
||||
|
||||
def test_me_sftp_reset_password_disabled_returns_400(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1")
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
# seed user + token
|
||||
from argus.service.config import V2Config
|
||||
from argus.service.db import Db
|
||||
|
||||
root = yaml.safe_load(cfg_path.read_text(encoding="utf-8"))
|
||||
v2_cfg = V2Config.from_root_dict(root)
|
||||
db = Db(v2_cfg.sqlite.db_path)
|
||||
db.init()
|
||||
db.create_user(user_id="u1", display_name=None)
|
||||
token = db.issue_token(user_id="u1")
|
||||
|
||||
with TestClient(app) as c:
|
||||
r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"})
|
||||
assert r.status_code == 400
|
||||
|
||||
|
||||
def test_me_sftp_reset_password_enabled_returns_password(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg = yaml.safe_load(_write_config(tmp_path).read_text(encoding="utf-8"))
|
||||
cfg["data"]["sftpgo"] = {
|
||||
"enabled": True,
|
||||
"host": "127.0.0.1",
|
||||
"sftp_port": 2022,
|
||||
"admin_api_base": "http://127.0.0.1:8081",
|
||||
"admin_user": "admin",
|
||||
"admin_password_env": "SFTPGO_ADMIN_PASSWORD",
|
||||
}
|
||||
cfg_path = tmp_path / "cfg_sftp.yaml"
|
||||
cfg_path.write_text(yaml.safe_dump(cfg), encoding="utf-8")
|
||||
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1")
|
||||
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
|
||||
|
||||
class _FakeSFTPGo:
|
||||
def __init__(self, **kwargs):
|
||||
self.reset = []
|
||||
self.enabled = []
|
||||
|
||||
def reset_password(self, username: str, new_password: str, home_dir: str):
|
||||
assert username
|
||||
assert new_password
|
||||
assert home_dir
|
||||
self.reset.append((username, home_dir))
|
||||
|
||||
def enable_user(self, username: str, home_dir: str):
|
||||
self.enabled.append((username, home_dir))
|
||||
|
||||
fake_client = _FakeSFTPGo()
|
||||
|
||||
class _FakeSFTPGoFactory:
|
||||
def __call__(self, **kwargs):
|
||||
return fake_client
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _FakeSFTPGoFactory())
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
# seed user in DB
|
||||
from argus.service.db import Db
|
||||
from argus.service.config import V2Config
|
||||
|
||||
v2_cfg = V2Config.from_root_dict(cfg)
|
||||
db = Db(v2_cfg.sqlite.db_path)
|
||||
db.init()
|
||||
db.create_user(user_id="u1", display_name=None)
|
||||
token = db.issue_token(user_id="u1")
|
||||
|
||||
with TestClient(app) as c:
|
||||
r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"})
|
||||
assert r.status_code == 200
|
||||
j = r.json()
|
||||
assert j["user_id"] == "u1"
|
||||
assert isinstance(j["password"], str) and len(j["password"]) >= 8
|
||||
|
||||
225
src/mvp/py/tests/test_janitor.py
Normal file
225
src/mvp/py/tests/test_janitor.py
Normal file
@ -0,0 +1,225 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from argus.service.db import Db
|
||||
from argus.service.janitor import JobsJanitor
|
||||
|
||||
|
||||
def _iso_z(dt: datetime) -> str:
|
||||
return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
def _mk_job_dir(user_root: Path, user_id: str, sid: str) -> Path:
|
||||
p = user_root / user_id / "jobs" / sid
|
||||
p.mkdir(parents=True, exist_ok=True)
|
||||
(p / "marker.txt").write_text("x", encoding="utf-8")
|
||||
return p
|
||||
|
||||
|
||||
def _mk_trash_dir(user_root: Path, user_id: str, sid: str) -> Path:
|
||||
p = user_root / user_id / "trash" / "jobs" / sid
|
||||
p.mkdir(parents=True, exist_ok=True)
|
||||
(p / "marker.txt").write_text("x", encoding="utf-8")
|
||||
return p
|
||||
|
||||
|
||||
def test_janitor_moves_jobs_to_trash_after_3_days(tmp_path: Path) -> None:
|
||||
db_path = tmp_path / "mvp.sqlite3"
|
||||
user_root = tmp_path / "users"
|
||||
db = Db(str(db_path))
|
||||
db.init()
|
||||
|
||||
task_id = "t1"
|
||||
user_id = "alice"
|
||||
sid = "sid-a01"
|
||||
db.create_task_v25(task_id=task_id, user_id=user_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
|
||||
|
||||
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
|
||||
ended = now - timedelta(days=4)
|
||||
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
|
||||
|
||||
src = _mk_job_dir(user_root, user_id, sid)
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
jan.tick_once(now=now)
|
||||
|
||||
assert not src.exists()
|
||||
dst = user_root / user_id / "trash" / "jobs" / sid
|
||||
assert dst.exists()
|
||||
assert (dst / "marker.txt").read_text(encoding="utf-8") == "x"
|
||||
|
||||
|
||||
def test_janitor_purges_from_trash_after_7_days(tmp_path: Path) -> None:
|
||||
db_path = tmp_path / "mvp.sqlite3"
|
||||
user_root = tmp_path / "users"
|
||||
db = Db(str(db_path))
|
||||
db.init()
|
||||
|
||||
task_id = "t2"
|
||||
user_id = "alice"
|
||||
sid = "sid-a01"
|
||||
db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
|
||||
|
||||
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
|
||||
ended = now - timedelta(days=8)
|
||||
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED")
|
||||
|
||||
_mk_trash_dir(user_root, user_id, sid)
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
jan.tick_once(now=now)
|
||||
|
||||
dst = user_root / user_id / "trash" / "jobs" / sid
|
||||
assert not dst.exists()
|
||||
|
||||
|
||||
def test_janitor_does_not_touch_recent_jobs(tmp_path: Path) -> None:
|
||||
db_path = tmp_path / "mvp.sqlite3"
|
||||
user_root = tmp_path / "users"
|
||||
db = Db(str(db_path))
|
||||
db.init()
|
||||
|
||||
task_id = "t3"
|
||||
user_id = "alice"
|
||||
sid = "sid-a01"
|
||||
db.create_task_v25(task_id=task_id, user_id=user_id, workload="grpo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
|
||||
|
||||
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
|
||||
ended = now - timedelta(days=1)
|
||||
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED")
|
||||
|
||||
src = _mk_job_dir(user_root, user_id, sid)
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
jan.tick_once(now=now)
|
||||
|
||||
assert src.exists()
|
||||
assert not (user_root / user_id / "trash" / "jobs" / sid).exists()
|
||||
|
||||
|
||||
def test_janitor_skips_tasks_without_user_id(tmp_path: Path) -> None:
|
||||
db_path = tmp_path / "mvp.sqlite3"
|
||||
user_root = tmp_path / "users"
|
||||
db = Db(str(db_path))
|
||||
db.init()
|
||||
|
||||
task_id = "legacy"
|
||||
sid = "sid-legacy"
|
||||
db.create_task(task_id=task_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
|
||||
|
||||
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
|
||||
ended = now - timedelta(days=10)
|
||||
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
|
||||
|
||||
# Even if someone created a matching directory under user_root, janitor shouldn't touch it because user_id is NULL.
|
||||
src = _mk_job_dir(user_root, "alice", sid)
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
jan.tick_once(now=now)
|
||||
assert src.exists()
|
||||
|
||||
|
||||
def test_janitor_validates_retention_days(tmp_path: Path) -> None:
|
||||
db = Db(str(tmp_path / "mvp.sqlite3"))
|
||||
db.init()
|
||||
try:
|
||||
JobsJanitor(db=db, user_root="/tmp", trash_after_days=-1, purge_after_days=7, interval_s=1)
|
||||
raise AssertionError("expected ValueError")
|
||||
except ValueError:
|
||||
pass
|
||||
try:
|
||||
JobsJanitor(db=db, user_root="/tmp", trash_after_days=3, purge_after_days=1, interval_s=1)
|
||||
raise AssertionError("expected ValueError")
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
|
||||
def test_janitor_noop_when_disabled(tmp_path: Path) -> None:
|
||||
db = Db(str(tmp_path / "mvp.sqlite3"))
|
||||
db.init()
|
||||
user_root = tmp_path / "users"
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=0, purge_after_days=0, interval_s=1)
|
||||
jan.tick_once(now=datetime(2025, 1, 1, tzinfo=timezone.utc))
|
||||
|
||||
|
||||
def test_janitor_handles_invalid_end_time_and_missing_fields(tmp_path: Path) -> None:
|
||||
db_path = tmp_path / "mvp.sqlite3"
|
||||
user_root = tmp_path / "users"
|
||||
db = Db(str(db_path))
|
||||
db.init()
|
||||
|
||||
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
|
||||
cutoff_ended = now - timedelta(days=10)
|
||||
|
||||
# Missing end_time (empty string) => should be skipped.
|
||||
db.create_task_v25(task_id="t4", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id="t4", attempt_no=1, ray_submission_id="sid-empty")
|
||||
db.update_attempt(task_id="t4", attempt_no=1, end_time="", ray_status="SUCCEEDED")
|
||||
|
||||
# Invalid ISO but still lexicographically <= cutoff => should be skipped.
|
||||
db.create_task_v25(task_id="t5", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id="t5", attempt_no=1, ray_submission_id="sid-bad")
|
||||
db.update_attempt(task_id="t5", attempt_no=1, end_time="2025-01-01T00:00:00ZZ", ray_status="FAILED")
|
||||
|
||||
_mk_job_dir(user_root, "alice", "sid-empty")
|
||||
_mk_job_dir(user_root, "alice", "sid-bad")
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
jan.tick_once(now=cutoff_ended + timedelta(days=10))
|
||||
|
||||
assert (user_root / "alice" / "jobs" / "sid-empty").exists()
|
||||
assert (user_root / "alice" / "jobs" / "sid-bad").exists()
|
||||
|
||||
|
||||
def test_janitor_purge_moves_from_jobs_then_deletes(tmp_path: Path, monkeypatch) -> None:
|
||||
db_path = tmp_path / "mvp.sqlite3"
|
||||
user_root = tmp_path / "users"
|
||||
db = Db(str(db_path))
|
||||
db.init()
|
||||
|
||||
task_id = "t6"
|
||||
user_id = "alice"
|
||||
sid = "sid-a01"
|
||||
db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
|
||||
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
|
||||
|
||||
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
|
||||
ended = now - timedelta(days=9)
|
||||
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
|
||||
|
||||
src = _mk_job_dir(user_root, user_id, sid)
|
||||
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
|
||||
jan.tick_once(now=now)
|
||||
assert not src.exists()
|
||||
assert not (user_root / user_id / "trash" / "jobs" / sid).exists()
|
||||
|
||||
|
||||
def test_janitor_run_forever_requires_event_like(tmp_path: Path) -> None:
|
||||
db = Db(str(tmp_path / "mvp.sqlite3"))
|
||||
db.init()
|
||||
jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
try:
|
||||
jan.run_forever(object())
|
||||
raise AssertionError("expected ValueError")
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
|
||||
def test_janitor_run_forever_survives_tick_exceptions(tmp_path: Path, monkeypatch) -> None:
|
||||
db = Db(str(tmp_path / "mvp.sqlite3"))
|
||||
db.init()
|
||||
jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1)
|
||||
|
||||
class Flag:
|
||||
def __init__(self) -> None:
|
||||
self.n = 0
|
||||
|
||||
def is_set(self) -> bool:
|
||||
self.n += 1
|
||||
return self.n >= 2
|
||||
|
||||
monkeypatch.setattr(jan, "tick_once", lambda **_: (_ for _ in ()).throw(RuntimeError("boom")))
|
||||
monkeypatch.setattr("argus.service.janitor.time.sleep", lambda *_: None)
|
||||
jan.run_forever(Flag())
|
||||
@ -38,3 +38,18 @@ def test_v2_config_requires_mappings():
|
||||
V2Config.from_root_dict({"service": ["nope"]})
|
||||
with pytest.raises(ValueError, match="config\\.service\\.\\{api,auth,sqlite,scheduler\\} must be mappings"):
|
||||
V2Config.from_root_dict({"service": {"api": [1], "auth": {}, "sqlite": {}, "scheduler": {}}})
|
||||
|
||||
|
||||
def test_v2_config_requires_data_mappings():
|
||||
from argus.service.config import V2Config
|
||||
|
||||
base = {
|
||||
"ray": {"shared_root": "/private"},
|
||||
"service": {"api": {}, "auth": {}, "sqlite": {}, "scheduler": {}},
|
||||
}
|
||||
|
||||
with pytest.raises(ValueError, match="config\\.data must be a mapping"):
|
||||
V2Config.from_root_dict({**base, "data": ["nope"]})
|
||||
|
||||
with pytest.raises(ValueError, match="config\\.data\\.\\{sftpgo,retention\\} must be mappings"):
|
||||
V2Config.from_root_dict({**base, "data": {"sftpgo": ["x"], "retention": {}}})
|
||||
|
||||
322
src/mvp/py/tests/test_sftpgo.py
Normal file
322
src/mvp/py/tests/test_sftpgo.py
Normal file
@ -0,0 +1,322 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import yaml
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
|
||||
def _write_config(tmp_path: Path, *, enabled: bool) -> Path:
|
||||
cfg = {
|
||||
"ray": {
|
||||
"address": "http://127.0.0.1:8265",
|
||||
"shared_root": "/private",
|
||||
"entrypoint_resources": {"worker_node": 1},
|
||||
"runtime_env": {"env_vars": {}},
|
||||
},
|
||||
"data": {
|
||||
"user_root": str(tmp_path / "users"),
|
||||
"sftpgo": {
|
||||
"enabled": bool(enabled),
|
||||
"admin_api_base": "http://127.0.0.1:8081/api/v2",
|
||||
"admin_user": "admin",
|
||||
"admin_password_env": "SFTPGO_ADMIN_PASSWORD",
|
||||
"host": "h1.internal",
|
||||
"sftp_port": 2022,
|
||||
},
|
||||
},
|
||||
"service": {
|
||||
"api": {"host": "127.0.0.1", "port": 0},
|
||||
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
|
||||
"sqlite": {"db_path": str(tmp_path / "mvp.sqlite3")},
|
||||
"scheduler": {"tick_s": 1, "retry_interval_s": 1, "max_running_tasks": 1},
|
||||
},
|
||||
}
|
||||
p = tmp_path / "cfg.yaml"
|
||||
p.write_text(yaml.safe_dump(cfg), encoding="utf-8")
|
||||
return p
|
||||
|
||||
|
||||
def test_create_user_calls_sftpgo_when_enabled(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path, enabled=True)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
|
||||
|
||||
calls = {"create": [], "disable": [], "reset": []}
|
||||
|
||||
class _Client:
|
||||
def __init__(self, **kwargs):
|
||||
pass
|
||||
|
||||
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
|
||||
calls["create"].append((username, password, home_dir))
|
||||
|
||||
def enable_user(self, *, username: str, home_dir: str) -> None:
|
||||
# Not used in this test, but required by app for idempotent upsert.
|
||||
return None
|
||||
|
||||
def disable_user(self, *, username: str, home_dir: str) -> None:
|
||||
calls["disable"].append(username)
|
||||
|
||||
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
|
||||
calls["reset"].append((username, new_password))
|
||||
|
||||
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client)
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"})
|
||||
assert r.status_code == 200
|
||||
assert calls["create"]
|
||||
username, password, home_dir = calls["create"][-1]
|
||||
assert username == "alice"
|
||||
assert password
|
||||
assert home_dir.endswith("/users/alice")
|
||||
|
||||
r2 = c.post("/api/v2/users/alice:disable", headers=admin_headers)
|
||||
assert r2.status_code == 200
|
||||
assert calls["disable"] == ["alice"]
|
||||
|
||||
r3 = c.post("/api/v2/users/alice/sftp:reset_password", headers=admin_headers)
|
||||
assert r3.status_code == 200
|
||||
body = r3.json()
|
||||
assert body["user_id"] == "alice"
|
||||
assert body["password"]
|
||||
assert calls["reset"] and calls["reset"][-1][0] == "alice"
|
||||
|
||||
|
||||
def test_create_user_upserts_when_sftpgo_user_already_exists(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
from argus.service.sftpgo import SFTPGoError
|
||||
|
||||
cfg_path = _write_config(tmp_path, enabled=True)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
|
||||
|
||||
calls = {"create": 0, "reset": [], "enable": []}
|
||||
|
||||
class _Client:
|
||||
def __init__(self, **kwargs):
|
||||
pass
|
||||
|
||||
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
|
||||
calls["create"] += 1
|
||||
raise SFTPGoError("sftpgo http error: 409 Conflict")
|
||||
|
||||
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
|
||||
calls["reset"].append((username, new_password))
|
||||
|
||||
def enable_user(self, *, username: str, home_dir: str) -> None:
|
||||
calls["enable"].append(username)
|
||||
|
||||
def disable_user(self, *, username: str, home_dir: str) -> None:
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client)
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"})
|
||||
assert r.status_code == 200
|
||||
body = r.json()
|
||||
assert body["user_id"] == "bob"
|
||||
assert body.get("sftp", {}).get("password")
|
||||
assert calls["create"] == 1
|
||||
assert calls["reset"] and calls["reset"][-1][0] == "bob"
|
||||
assert calls["enable"] == ["bob"]
|
||||
|
||||
|
||||
def test_sftpgo_enabled_requires_admin_password_env(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path, enabled=True)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
monkeypatch.delenv("SFTPGO_ADMIN_PASSWORD", raising=False)
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"})
|
||||
assert r.status_code == 500
|
||||
assert "SFTPGO_ADMIN_PASSWORD" in r.text
|
||||
|
||||
|
||||
def test_sftpgo_admin_client_builds_requests(monkeypatch):
|
||||
import json
|
||||
from urllib.request import Request
|
||||
|
||||
from argus.service.sftpgo import SFTPGoAdminClient
|
||||
import argus.service.sftpgo as mod
|
||||
|
||||
seen: list[Request] = []
|
||||
|
||||
class _Resp:
|
||||
def __init__(self, body: bytes = b"ok"):
|
||||
self._body = body
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc, tb):
|
||||
return False
|
||||
|
||||
def read(self):
|
||||
return self._body
|
||||
|
||||
def fake_urlopen(req, timeout=0):
|
||||
seen.append(req)
|
||||
if req.full_url.endswith("/token"):
|
||||
return _Resp(body=b'{"access_token":"t1","expires_at":0}')
|
||||
# allow folder creates to be idempotent in tests
|
||||
if req.full_url.endswith("/folders") and req.get_method() == "POST":
|
||||
return _Resp(body=b'{"message":"created"}')
|
||||
if req.full_url.endswith("/users/alice") and req.get_method() == "GET":
|
||||
return _Resp(
|
||||
body=b'{"username":"alice","status":1,"home_dir":"/private/users/alice","uid":0,"gid":0,"permissions":{"/":["*"]},"virtual_folders":[]}'
|
||||
)
|
||||
return _Resp()
|
||||
|
||||
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
|
||||
|
||||
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common")
|
||||
c.create_user(username="alice", password="p1", home_dir="/private/users/alice")
|
||||
c.disable_user(username="alice", home_dir="/private/users/alice")
|
||||
c.enable_user(username="alice", home_dir="/private/users/alice")
|
||||
c.reset_password(username="alice", new_password="p2", home_dir="/private/users/alice")
|
||||
|
||||
# Each operation fetches a token, then issues a request.
|
||||
# For create_user: token + ensure folders (2 POSTs) + create user
|
||||
# For disable/enable/reset: token + ensure folders (2 POSTs) + GET user + PUT user
|
||||
assert len(seen) == 1 + 2 + 1 + 3 * (1 + 2 + 1 + 1)
|
||||
assert seen[0].full_url.endswith("/token")
|
||||
assert seen[0].headers.get("Authorization", "").startswith("Basic ")
|
||||
# create_user
|
||||
assert seen[1].full_url.endswith("/folders")
|
||||
assert seen[2].full_url.endswith("/folders")
|
||||
assert seen[3].full_url.endswith("/users")
|
||||
assert seen[3].headers.get("Authorization", "") == "Bearer t1"
|
||||
created = json.loads(seen[3].data.decode("utf-8"))
|
||||
assert created["username"] == "alice"
|
||||
assert "/common" in created.get("permissions", {})
|
||||
assert "/common/datasets" in created.get("permissions", {})
|
||||
assert "/common/hf" in created.get("permissions", {})
|
||||
|
||||
# disable_user
|
||||
assert seen[4].full_url.endswith("/token")
|
||||
assert seen[5].full_url.endswith("/folders")
|
||||
assert seen[6].full_url.endswith("/folders")
|
||||
assert seen[7].full_url.endswith("/users/alice")
|
||||
assert seen[7].get_method() == "GET"
|
||||
assert seen[8].full_url.endswith("/users/alice")
|
||||
assert seen[8].get_method() == "PUT"
|
||||
|
||||
# enable_user
|
||||
assert seen[9].full_url.endswith("/token")
|
||||
assert seen[10].full_url.endswith("/folders")
|
||||
assert seen[11].full_url.endswith("/folders")
|
||||
assert seen[12].full_url.endswith("/users/alice")
|
||||
assert seen[12].get_method() == "GET"
|
||||
assert seen[13].full_url.endswith("/users/alice")
|
||||
assert seen[13].get_method() == "PUT"
|
||||
|
||||
# reset_password
|
||||
assert seen[14].full_url.endswith("/token")
|
||||
assert seen[15].full_url.endswith("/folders")
|
||||
assert seen[16].full_url.endswith("/folders")
|
||||
assert seen[17].full_url.endswith("/users/alice")
|
||||
assert seen[17].get_method() == "GET"
|
||||
assert seen[18].full_url.endswith("/users/alice")
|
||||
assert seen[18].get_method() == "PUT"
|
||||
|
||||
|
||||
def test_sftpgo_admin_client_http_error_raises(monkeypatch):
|
||||
from urllib.error import HTTPError
|
||||
|
||||
from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError
|
||||
import argus.service.sftpgo as mod
|
||||
|
||||
def fake_urlopen(req, timeout=0):
|
||||
if req.full_url.endswith("/token"):
|
||||
class _Resp:
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc, tb):
|
||||
return False
|
||||
|
||||
def read(self):
|
||||
return b'{"access_token":"t1","expires_at":0}'
|
||||
|
||||
return _Resp()
|
||||
raise HTTPError(req.full_url, 500, "boom", hdrs=None, fp=None)
|
||||
|
||||
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
|
||||
|
||||
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common")
|
||||
try:
|
||||
c.create_user(username="alice", password="p1", home_dir="/private/users/alice")
|
||||
assert False, "expected SFTPGoError"
|
||||
except SFTPGoError as e:
|
||||
assert "http error" in str(e)
|
||||
|
||||
|
||||
def test_sftpgo_admin_client_url_error_raises(monkeypatch):
|
||||
from urllib.error import URLError
|
||||
|
||||
from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError
|
||||
import argus.service.sftpgo as mod
|
||||
|
||||
def fake_urlopen(req, timeout=0):
|
||||
if req.full_url.endswith("/token"):
|
||||
class _Resp:
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc, tb):
|
||||
return False
|
||||
|
||||
def read(self):
|
||||
return b'{"access_token":"t1","expires_at":0}'
|
||||
|
||||
return _Resp()
|
||||
raise URLError("no route")
|
||||
|
||||
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
|
||||
|
||||
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw")
|
||||
try:
|
||||
c.disable_user(username="alice", home_dir="/private/users/alice")
|
||||
assert False, "expected SFTPGoError"
|
||||
except SFTPGoError as e:
|
||||
assert "connection error" in str(e)
|
||||
78
src/mvp/py/tests/test_ui.py
Normal file
78
src/mvp/py/tests/test_ui.py
Normal file
@ -0,0 +1,78 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
from argus.service.app import create_app
|
||||
|
||||
|
||||
def _write_config(tmp_path: Path) -> Path:
|
||||
p = tmp_path / "cfg.yaml"
|
||||
p.write_text(
|
||||
"""
|
||||
ray:
|
||||
address: "http://127.0.0.1:8265"
|
||||
shared_root: "/private"
|
||||
entrypoint_num_cpus: 1
|
||||
entrypoint_resources: { worker_node: 1 }
|
||||
runtime_env: { env_vars: { PYTHONUNBUFFERED: "1" } }
|
||||
service:
|
||||
api: { host: "127.0.0.1", port: 8080 }
|
||||
auth: { token_env: "MVP_INTERNAL_TOKEN" }
|
||||
sqlite: { db_path: "%(db)s" }
|
||||
data:
|
||||
user_root: "%(users)s"
|
||||
sftpgo: { enabled: false }
|
||||
retention: { jobs_trash_after_days: 3, jobs_purge_after_days: 7, janitor_interval_s: 3600 }
|
||||
"""
|
||||
% {"db": str(tmp_path / "mvp.sqlite3"), "users": str(tmp_path / "users")}
|
||||
)
|
||||
return p
|
||||
|
||||
|
||||
def test_ui_routes_render_200(tmp_path, monkeypatch):
|
||||
cfg = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
|
||||
app = create_app(str(cfg))
|
||||
c = TestClient(app)
|
||||
|
||||
for path in (
|
||||
"/ui",
|
||||
"/ui/login",
|
||||
"/ui/tasks",
|
||||
"/ui/tasks/new",
|
||||
"/ui/data",
|
||||
"/ui/admin",
|
||||
"/ui/tasks/any-task-id",
|
||||
"/ui/tasks/any-task-id/logs",
|
||||
):
|
||||
r = c.get(path, allow_redirects=True)
|
||||
assert r.status_code == 200
|
||||
assert "<html" in r.text.lower()
|
||||
|
||||
|
||||
def test_ui_contains_sidebar_links(tmp_path, monkeypatch):
|
||||
cfg = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
|
||||
app = create_app(str(cfg))
|
||||
c = TestClient(app)
|
||||
|
||||
r = c.get("/ui/tasks")
|
||||
assert r.status_code == 200
|
||||
for link in ("/ui/tasks", "/ui/tasks/new", "/ui/data", "/ui/login", "/ui/admin"):
|
||||
assert link in r.text
|
||||
assert "Ray Dashboard" in r.text
|
||||
|
||||
|
||||
def test_ui_task_detail_shows_ids(tmp_path, monkeypatch):
|
||||
cfg = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
|
||||
app = create_app(str(cfg))
|
||||
c = TestClient(app)
|
||||
|
||||
task_id = "mvp3-ppo-20250101-000000-aaaa"
|
||||
r = c.get(f"/ui/tasks/{task_id}")
|
||||
assert r.status_code == 200
|
||||
assert task_id in r.text
|
||||
assert f"/ui/tasks/{task_id}/logs" in r.text
|
||||
@ -14,6 +14,11 @@ def _write_config(tmp_path: Path) -> Path:
|
||||
"entrypoint_resources": {"worker_node": 1},
|
||||
"runtime_env": {"env_vars": {}},
|
||||
},
|
||||
"data": {
|
||||
# Avoid touching real /private in tests. Keep ray.shared_root as /private
|
||||
# so existing path validation tests remain unchanged.
|
||||
"user_root": str(tmp_path / "users"),
|
||||
},
|
||||
"service": {
|
||||
"api": {"host": "127.0.0.1", "port": 0},
|
||||
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
|
||||
@ -59,6 +64,9 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
# list users requires admin
|
||||
assert c.get("/api/v2/users", headers={"authorization": "Bearer nope"}).status_code in (401, 403)
|
||||
|
||||
r1 = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice", "display_name": "Alice"})
|
||||
assert r1.status_code == 200
|
||||
assert r1.json()["user_id"] == "alice"
|
||||
@ -68,6 +76,11 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon
|
||||
token = r2.json()["token"]
|
||||
assert token
|
||||
|
||||
r2b = c.get("/api/v2/users", headers=admin_headers)
|
||||
assert r2b.status_code == 200
|
||||
users = r2b.json()["users"]
|
||||
assert any(u.get("user_id") == "alice" for u in users)
|
||||
|
||||
# non-admin token can access regular endpoints
|
||||
r3 = c.get("/api/v2/queue", headers={"authorization": f"Bearer {token}"})
|
||||
assert r3.status_code == 200
|
||||
@ -177,3 +190,165 @@ def test_submit_rejects_non_common_inputs(tmp_path: Path, monkeypatch):
|
||||
)
|
||||
assert r.status_code == 400
|
||||
assert "code_path must start with /private/common/" in r.text
|
||||
|
||||
|
||||
def test_submit_accepts_user_dataset_paths_and_local_model_paths(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
|
||||
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
|
||||
alice_headers = {"authorization": f"Bearer {alice_tok}"}
|
||||
|
||||
# User dataset paths are allowed.
|
||||
r1 = c.post(
|
||||
"/api/v2/tasks",
|
||||
headers=alice_headers,
|
||||
data=(
|
||||
"workload: ppo\n"
|
||||
"code_path: /private/common/code/verl\n"
|
||||
"model_id: Qwen/Qwen2.5-0.5B-Instruct\n"
|
||||
"train_file: /private/users/alice/datasets/t\n"
|
||||
),
|
||||
)
|
||||
assert r1.status_code == 200
|
||||
|
||||
# Local model paths under user models/ are allowed (no TaskSpec schema change).
|
||||
r2 = c.post(
|
||||
"/api/v2/tasks",
|
||||
headers=alice_headers,
|
||||
data=(
|
||||
"workload: ppo\n"
|
||||
"code_path: /private/common/code/verl\n"
|
||||
"model_id: /private/users/alice/models/m1\n"
|
||||
"train_file: /private/common/datasets/t\n"
|
||||
),
|
||||
)
|
||||
assert r2.status_code == 200
|
||||
|
||||
|
||||
def test_submit_rejects_cross_user_paths_and_bad_local_model_dirs(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
|
||||
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"}).status_code == 200
|
||||
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
|
||||
bob_tok = c.post("/api/v2/users/bob/tokens", headers=admin_headers).json()["token"]
|
||||
bob_headers = {"authorization": f"Bearer {bob_tok}"}
|
||||
|
||||
# Cross-user dataset path should be rejected.
|
||||
r1 = c.post(
|
||||
"/api/v2/tasks",
|
||||
headers=bob_headers,
|
||||
data=(
|
||||
"workload: ppo\n"
|
||||
"code_path: /private/common/code/verl\n"
|
||||
"model_id: Qwen/Qwen2.5-0.5B-Instruct\n"
|
||||
"train_file: /private/users/alice/datasets/t\n"
|
||||
),
|
||||
)
|
||||
assert r1.status_code == 400
|
||||
assert "/private/users/bob/datasets/" in r1.text
|
||||
|
||||
# Local model path must be under models/.
|
||||
r2 = c.post(
|
||||
"/api/v2/tasks",
|
||||
headers=bob_headers,
|
||||
data=(
|
||||
"workload: ppo\n"
|
||||
"code_path: /private/common/code/verl\n"
|
||||
"model_id: /private/users/bob/jobs/j1/checkpoints\n"
|
||||
"train_file: /private/common/datasets/t\n"
|
||||
),
|
||||
)
|
||||
assert r2.status_code == 400
|
||||
assert "model_id local path must start with" in r2.text
|
||||
|
||||
|
||||
def test_me_returns_paths_and_retention(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
|
||||
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
|
||||
|
||||
r = c.get("/api/v2/me", headers={"authorization": f"Bearer {alice_tok}"})
|
||||
assert r.status_code == 200
|
||||
obj = r.json()
|
||||
assert obj["user_id"] == "alice"
|
||||
assert obj["paths"]["home"].endswith("/users/alice")
|
||||
assert obj["paths"]["jobs"].endswith("/users/alice/jobs")
|
||||
assert obj["paths"]["trash_jobs"].endswith("/users/alice/trash/jobs")
|
||||
assert obj["retention"]["jobs_trash_after_days"] == 3
|
||||
assert obj["retention"]["jobs_purge_after_days"] == 7
|
||||
|
||||
|
||||
def test_create_user_creates_user_dirs(tmp_path: Path, monkeypatch):
|
||||
from argus.service import app as app_mod
|
||||
|
||||
cfg_path = _write_config(tmp_path)
|
||||
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
|
||||
|
||||
class _Scheduler:
|
||||
def __init__(self, **kwargs):
|
||||
self.tool = object()
|
||||
|
||||
def run_forever(self, stop_flag):
|
||||
return None
|
||||
|
||||
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
|
||||
app = app_mod.create_app(str(cfg_path))
|
||||
|
||||
admin_headers = {"authorization": "Bearer adm1"}
|
||||
with TestClient(app) as c:
|
||||
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
|
||||
|
||||
base = tmp_path / "users" / "alice"
|
||||
assert (base / "datasets").is_dir()
|
||||
assert (base / "models").is_dir()
|
||||
assert (base / "code").is_dir()
|
||||
assert (base / "jobs").is_dir()
|
||||
assert (base / "trash" / "jobs").is_dir()
|
||||
|
||||
@ -9,4 +9,4 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
source "${SCRIPT_DIR}/lib.sh"
|
||||
|
||||
dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -U pip >/dev/null 2>&1 || true"
|
||||
dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -r /workspace/mvp/py/requirements.txt"
|
||||
dexec "${HEAD_CONTAINER}" bash -lc "python3 -c 'import fastapi,uvicorn,yaml' >/dev/null 2>&1 && echo 'api_deps_ok: skip' || (python3 -m pip install -r /workspace/mvp/py/requirements.txt || echo 'WARN: api deps install failed; continuing with preinstalled deps')"
|
||||
|
||||
@ -22,7 +22,12 @@ if [[ -z "${MVP_INTERNAL_TOKEN:-}" ]]; then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
docker exec -d -e MVP_INTERNAL_TOKEN="${MVP_INTERNAL_TOKEN}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'"
|
||||
env_args=(-e "MVP_INTERNAL_TOKEN=${MVP_INTERNAL_TOKEN}")
|
||||
if [[ -n "${SFTPGO_ADMIN_PASSWORD:-}" ]]; then
|
||||
env_args+=(-e "SFTPGO_ADMIN_PASSWORD=${SFTPGO_ADMIN_PASSWORD}")
|
||||
fi
|
||||
|
||||
docker exec -d "${env_args[@]}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'"
|
||||
|
||||
echo "[host] started; pid stored in ${PID_PATH} (container path)"
|
||||
echo "[host] logs: ${LOG_PATH} (container path)"
|
||||
|
||||
242
src/mvp/scripts/run_all_v30_api.sh
Executable file
242
src/mvp/scripts/run_all_v30_api.sh
Executable file
@ -0,0 +1,242 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=lib.sh
|
||||
source "${SCRIPT_DIR}/lib.sh"
|
||||
|
||||
# E2E v3.0:
|
||||
# - Start Ray (head + stateless workers) + SFTPGo (compose)
|
||||
# - Start API server with v3.0 config
|
||||
# - Create user (API triggers SFTPGo user create) and return one-time SFTP password
|
||||
# - (Optional) verify SFTP login
|
||||
# - Submit PPO/GRPO/SFT referencing user dataset paths and wait
|
||||
|
||||
API_ADDR="${API_ADDR:-http://127.0.0.1:8080}"
|
||||
ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-}"
|
||||
USER_ID="${USER_ID:-alice}"
|
||||
RESET_DB="${RESET_DB:-1}"
|
||||
RESET_SFTPGO="${RESET_SFTPGO:-0}"
|
||||
EXPECTED_RAY_NODES="${EXPECTED_RAY_NODES:-3}" # head + 2 workers
|
||||
CLUSTER_NAME="${CLUSTER_NAME:-argus-ray}"
|
||||
|
||||
CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/configs/dev_v30.yaml}"
|
||||
SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
|
||||
export SFTPGO_ADMIN_PASSWORD
|
||||
|
||||
if [[ -z "${ADMIN_TOKEN}" ]]; then
|
||||
echo "ERROR: MVP_INTERNAL_TOKEN must be set in host env (admin token)" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
api_curl_admin() {
|
||||
curl -sS -H "Authorization: Bearer ${ADMIN_TOKEN}" "$@"
|
||||
}
|
||||
|
||||
api_wait_ready() {
|
||||
local tries="${1:-60}"
|
||||
for i in $(seq 1 "${tries}"); do
|
||||
if curl -sS -m 2 "${API_ADDR}/docs" >/dev/null 2>&1; then
|
||||
echo "[host] api_ready: ${API_ADDR}"
|
||||
return 0
|
||||
fi
|
||||
echo "[host] waiting api... (${i}/${tries})"
|
||||
sleep 2
|
||||
done
|
||||
echo "ERROR: api not ready: ${API_ADDR}" >&2
|
||||
return 1
|
||||
}
|
||||
|
||||
sftpgo_wait_ready() {
|
||||
local tries="${1:-60}"
|
||||
local url="${2:-http://127.0.0.1:8081/api/v2/token}"
|
||||
for i in $(seq 1 "${tries}"); do
|
||||
# SFTPGo admin endpoints require auth; readiness means "HTTP reachable and can issue token".
|
||||
if curl -sS -m 2 -u "admin:${SFTPGO_ADMIN_PASSWORD}" "${url}" >/dev/null 2>&1; then
|
||||
echo "[host] sftpgo_ready: ${url} (token ok)"
|
||||
return 0
|
||||
fi
|
||||
echo "[host] waiting sftpgo... (${i}/${tries})"
|
||||
sleep 2
|
||||
done
|
||||
echo "ERROR: sftpgo not ready: ${url}" >&2
|
||||
return 1
|
||||
}
|
||||
|
||||
ray_wait_ready() {
|
||||
local tries="${1:-60}"
|
||||
for i in $(seq 1 "${tries}"); do
|
||||
if curl -sS -m 2 "${RAY_DASHBOARD_ADDR}/api/version" >/dev/null 2>&1; then
|
||||
echo "[host] ray_dashboard_ready: ${RAY_DASHBOARD_ADDR}"
|
||||
return 0
|
||||
fi
|
||||
echo "[host] waiting ray dashboard... (${i}/${tries})"
|
||||
sleep 2
|
||||
done
|
||||
echo "ERROR: ray dashboard not ready: ${RAY_DASHBOARD_ADDR}" >&2
|
||||
return 1
|
||||
}
|
||||
|
||||
ray_wait_nodes() {
|
||||
local want="${1:-3}"
|
||||
local tries="${2:-60}"
|
||||
for i in $(seq 1 "${tries}"); do
|
||||
local out n
|
||||
out="$(docker exec -i "${HEAD_CONTAINER}" python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False, logging_level='ERROR'); print(sum(1 for n in ray.nodes() if n.get('Alive')))" 2>/dev/null || true)"
|
||||
n="$(printf '%s\n' "${out}" | tail -n 1 | tr -cd '0-9' || true)"
|
||||
if [[ "${n}" =~ ^[0-9]+$ ]]; then
|
||||
echo "[host] ray_nodes_alive=${n} (want>=${want})"
|
||||
if [[ "${n}" -ge "${want}" ]]; then
|
||||
return 0
|
||||
fi
|
||||
else
|
||||
echo "[host] waiting ray nodes... (${i}/${tries})"
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo "ERROR: ray nodes not ready (want>=${want})" >&2
|
||||
docker exec -i "${HEAD_CONTAINER}" bash -lc "ray status || true" >&2 || true
|
||||
return 1
|
||||
}
|
||||
|
||||
submit_taskspec_inline() {
|
||||
local token="$1"
|
||||
local yaml_body="$2"
|
||||
local resp
|
||||
resp="$(curl -sS -H "Authorization: Bearer ${token}" -H "Content-Type: application/yaml" --data-binary "${yaml_body}" "${API_ADDR}/api/v2/tasks")"
|
||||
echo "[host] submit_resp: ${resp}" >&2
|
||||
printf '%s' "${resp}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["task_id"])'
|
||||
}
|
||||
|
||||
wait_task() {
|
||||
local token="$1"
|
||||
local task_id="$2"
|
||||
while true; do
|
||||
local body state
|
||||
body="$(curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}")"
|
||||
state="$(printf '%s' "${body}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["state"])')"
|
||||
echo "[host] task ${task_id}: ${state}"
|
||||
|
||||
if [[ "${state}" == "SUCCEEDED" ]]; then
|
||||
return 0
|
||||
fi
|
||||
if [[ "${state}" == "FAILED" || "${state}" == "CANCELED" ]]; then
|
||||
echo "[host] terminal=${state}; tail logs (best-effort):" >&2
|
||||
curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}/logs?tail=200" >&2 || true
|
||||
return 1
|
||||
fi
|
||||
sleep 10
|
||||
done
|
||||
}
|
||||
|
||||
echo "[host] ===== run_all_v30_api.sh begin ====="
|
||||
|
||||
"${SCRIPT_DIR}/00_prereq_check.sh"
|
||||
"${SCRIPT_DIR}/03_cleanup_v1_legacy.sh"
|
||||
"${SCRIPT_DIR}/04_cleanup_v2_legacy.sh"
|
||||
|
||||
echo "[host] bring down existing containers (best-effort)"
|
||||
"${SCRIPT_DIR}/02_down.sh" || true
|
||||
|
||||
if [[ "${RESET_SFTPGO}" == "1" ]]; then
|
||||
echo "[host] reset sftpgo metadata dir (best-effort, via helper container)"
|
||||
SFTPGO_META_DIR="${ROOT_DIR}/../../shared/common/sftpgo"
|
||||
mkdir -p "${SFTPGO_META_DIR}"
|
||||
docker run --rm --entrypoint sh -u 0:0 -v "${SFTPGO_META_DIR}:/mnt" drakkan/sftpgo:latest -lc "rm -rf /mnt/* || true"
|
||||
fi
|
||||
|
||||
echo "[host] (re)create containers (Ray + SFTPGo)"
|
||||
"${SCRIPT_DIR}/01_up.sh"
|
||||
|
||||
echo "[host] wait ray head ready"
|
||||
ray_wait_ready 60
|
||||
|
||||
echo "[host] wait sftpgo ready"
|
||||
sftpgo_wait_ready 60 "http://127.0.0.1:8081/api/v2/token"
|
||||
|
||||
echo "[host] render v3.0 config with SFTPGo container IP (work around docker DNS issues)"
|
||||
SFTPGO_IP="$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' argus-sftpgo)"
|
||||
RENDERED_CFG_HOST_PATH="/tmp/dev_v30_rendered.yaml"
|
||||
sed -E "s#^(\\s*admin_api_base:) .*#\\1 \"http://${SFTPGO_IP}:8080/api/v2\"#g" "${ROOT_DIR}/configs/dev_v30.yaml" >"${RENDERED_CFG_HOST_PATH}"
|
||||
docker cp "${RENDERED_CFG_HOST_PATH}" "${HEAD_CONTAINER}:/tmp/dev_v30_rendered.yaml"
|
||||
CONFIG_IN_CONTAINER="/tmp/dev_v30_rendered.yaml"
|
||||
|
||||
echo "[host] verify head discovery record (supervised in-container)"
|
||||
HEAD_IP_FILE="${SHARED_ROOT}/ray/discovery/${CLUSTER_NAME}/head.json"
|
||||
dexec "${HEAD_CONTAINER}" bash -lc "test -f '${HEAD_IP_FILE}' && python3 -c 'import json,sys; print(json.load(open(sys.argv[1]))[\"job_server_url\"])' '${HEAD_IP_FILE}' || true"
|
||||
|
||||
echo "[host] wait workers join"
|
||||
ray_wait_nodes "${EXPECTED_RAY_NODES}" 120
|
||||
|
||||
echo "[host] prepare data/model (idempotent; reuse cache)"
|
||||
"${SCRIPT_DIR}/30_prepare_data_and_model.sh"
|
||||
|
||||
echo "[host] install api deps in head container"
|
||||
"${SCRIPT_DIR}/12_install_api_deps.sh"
|
||||
|
||||
echo "[host] stop api (best-effort)"
|
||||
"${SCRIPT_DIR}/61_stop_api.sh" || true
|
||||
|
||||
if [[ "${RESET_DB}" == "1" ]]; then
|
||||
echo "[host] reset api sqlite db in container (best-effort)"
|
||||
docker exec -i "${HEAD_CONTAINER}" bash -lc "rm -f /private/common/db/mvp.sqlite3 /private/common/db/mvp.sqlite3-wal /private/common/db/mvp.sqlite3-shm || true"
|
||||
else
|
||||
echo "[host] keep existing api sqlite db (RESET_DB=${RESET_DB})"
|
||||
fi
|
||||
|
||||
echo "[host] start api (admin token + sftpgo admin password via env)"
|
||||
MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER}" SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD}" "${SCRIPT_DIR}/60_start_api.sh"
|
||||
api_wait_ready 60
|
||||
|
||||
echo "[host] create user (expect SFTP one-time password in response)"
|
||||
create_resp="$(api_curl_admin -H "Content-Type: application/json" -d "{\"user_id\":\"${USER_ID}\"}" "${API_ADDR}/api/v2/users")"
|
||||
echo "[host] create_user_resp: ${create_resp}"
|
||||
USER_SFTP_PASSWORD="$(printf '%s' "${create_resp}" | python3 -c 'import sys,json; o=json.load(sys.stdin); print((o.get("sftp") or {}).get("password") or "")')"
|
||||
if [[ -z "${USER_SFTP_PASSWORD}" ]]; then
|
||||
echo "ERROR: expected sftp.password in create user response (is data.sftpgo.enabled=true?)" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "[host] issue user token"
|
||||
USER_TOKEN="$(api_curl_admin -X POST "${API_ADDR}/api/v2/users/${USER_ID}/tokens" | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')"
|
||||
echo "[host] user_token_issued: user=${USER_ID}"
|
||||
|
||||
echo "[host] (optional) verify SFTP login (best-effort)"
|
||||
if command -v sshpass >/dev/null 2>&1 && command -v sftp >/dev/null 2>&1; then
|
||||
tmp_batch="$(mktemp)"
|
||||
cat >"${tmp_batch}" <<EOF
|
||||
pwd
|
||||
ls
|
||||
EOF
|
||||
# NOTE: this just proves auth works; dataset upload can be done via SFTP manually later.
|
||||
sshpass -p "${USER_SFTP_PASSWORD}" sftp -o StrictHostKeyChecking=no -P 2022 "${USER_ID}@127.0.0.1" -b "${tmp_batch}" >/dev/null 2>&1 || true
|
||||
rm -f "${tmp_batch}" || true
|
||||
else
|
||||
echo "[host] skip: sshpass/sftp not found; you can test manually with: sftp -P 2022 ${USER_ID}@<host>"
|
||||
fi
|
||||
|
||||
echo "[host] ensure user dataset paths exist (copy from common if needed; best-effort)"
|
||||
echo "[host] copy dataset into user datasets path inside head container (avoid host permission issues)"
|
||||
dexec "${HEAD_CONTAINER}" bash -lc "set -euo pipefail; \
|
||||
mkdir -p '/private/users/${USER_ID}/datasets/gsm8k' '/private/users/${USER_ID}/datasets/gsm8k_sft'; \
|
||||
(cp -f /private/common/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || true); \
|
||||
(cp -f /private/common/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || true); \
|
||||
(cp -f /private/common/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || true); \
|
||||
(cp -f /private/common/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || true)"
|
||||
|
||||
echo "[host] submit PPO/GRPO/SFT via API using user dataset paths"
|
||||
PPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: ppo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
|
||||
GRPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: grpo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
|
||||
SFT_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: sft\nnnodes: 1\nn_gpus_per_node: 1\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
|
||||
|
||||
echo "[host] submitted task ids:"
|
||||
echo " ppo=${PPO_TASK_ID}"
|
||||
echo " grpo=${GRPO_TASK_ID}"
|
||||
echo " sft=${SFT_TASK_ID}"
|
||||
|
||||
echo "[host] wait for tasks (in submission order)"
|
||||
wait_task "${USER_TOKEN}" "${PPO_TASK_ID}"
|
||||
wait_task "${USER_TOKEN}" "${GRPO_TASK_ID}"
|
||||
wait_task "${USER_TOKEN}" "${SFT_TASK_ID}"
|
||||
|
||||
echo "[host] ===== run_all_v30_api.sh done ====="
|
||||
19
src/mvp/scripts/run_e2e_v30_cases.sh
Executable file
19
src/mvp/scripts/run_e2e_v30_cases.sh
Executable file
@ -0,0 +1,19 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
# Minimal E2E runner for v3.0:
|
||||
# - Happy path: Ray + SFTPGo + API + 3 workloads
|
||||
# - Leaves retention validation as a manual follow-up (adjust thresholds).
|
||||
|
||||
ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-my-dev-token}"
|
||||
USER_ID="${USER_ID:-alice}"
|
||||
|
||||
echo "[case] HP-1: run_all_v30_api.sh (Ray + SFTPGo + API + 3 workloads)"
|
||||
MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" USER_ID="${USER_ID}" RESET_DB=1 RESET_SFTPGO=0 "${SCRIPT_DIR}/run_all_v30_api.sh"
|
||||
|
||||
echo "[case] NOTE: retention validation (C2) is manual:"
|
||||
echo " - set data.retention.jobs_trash_after_days / jobs_purge_after_days to small values in configs/dev_v30.yaml"
|
||||
echo " - restart API server and wait for janitor"
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user