v3.0 开发验证完成,满足webui和最小数据管理需求, 任务模版目前偏简单,可以扩展一下,再release开放使用

This commit is contained in:
yuyr 2025-12-30 17:29:50 +08:00
parent 7b89d60b3b
commit 6d3fefc7a6
30 changed files with 3764 additions and 16 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

15
specs/mvp/v3.0/README.md Normal file
View File

@ -0,0 +1,15 @@
# MVP v3.0Design— WebUI + 用户数据上传/下载SFTPGo→ 首个可发布版本
本目录基于:
- `specs/mvp/mvp_roadmap_v2.md`(总体路线图)
- `specs/mvp/image/roadmap_v3.0.png`v3.0 迭代图)
- 当前已落地的 v2.5User Mgmt + Stateless Ray Node Pool
目标是在 v2.5 的基础上补齐 **用户数据闭环**(上传→训练可见→产物下载)以及最小可用的 **WebUI**,形成“可发布”的 v3.0 版本。
文档:
- `specs/mvp/v3.0/v3.0_design.md`总体架构与关键机制WebUI、SFTPGo、数据/权限模型、任务流)。
- `specs/mvp/v3.0/v3.0_api.md`v3.0 API 扩展设计UI、数据、SFTPGo 管理、权限约束)。
- `specs/mvp/v3.0/v3.0_acceptance.md`:部署/升级/验收流程与可验证标准(含故障注入与回归清单)。
- `specs/mvp/v3.0/v3.0_dev_plan.md`TDD 驱动的工程化开发计划里程碑拆分、测试分层、E2E 验收)。
- `specs/mvp/v3.0/v3.0_progress.md`:实施进展记录(每个里程碑完成后追加记录)。

View File

@ -0,0 +1,55 @@
# MVP v3.0 — 部署与验收流程(草案)
## 0) 环境前提
- Ray 集群:延续 v2.5 的 head + stateless worker自动 join
- 共享存储:容器内挂载 `/private`dev/prod 对齐)
- API server宿主机代码挂载到 head 容器,在 head 容器内启动
- 新增SFTPGo 服务(建议容器化部署)
## 1) 部署步骤(高层)
1) 部署/升级 Ray 节点镜像(沿用 v2.5 的 `argus/argus-ray-node:v2.5` 或更高版本)
2) 启动 Ray 集群compose 或平台创建容器)
3) 启动/配置 SFTPGo挂载 `/private`
4) 启动 API serverhead 容器内)
5) 启动 WebUI由 API server 托管)
## 2) 验收用例(必须通过)
### A. 用户与凭据
1) admin 创建用户 `alice`,签发 API token
2) 系统联动在 SFTPGo 创建 `alice`home=/private/users/alice
3) `alice` 使用 token 登录 WebUI或调用 `/api/v2/me` 成功)
### B. 上传数据闭环(核心)
1) `alice` 通过 SFTP 上传数据集到 `/private/users/alice/datasets/...`
2) `alice` 通过 WebUI/API 提交任务TaskSpec 引用该路径
3) Ray worker 读取该数据,任务 RUNNING 并最终 SUCCEEDED
### C. 下载产物闭环
1) 训练完成后,产物落到 `/private/users/alice/jobs/<submission_id>/...`
2) `alice` 通过 SFTP 下载 checkpoints/logs 成功
3) (新增)`alice` 将需要长期保留的权重从 `jobs/<submission_id>/...` 移动到 `models/`,确认移动后可长期存在
### C2. Jobs 回收站与自动清理3 天移入回收站7 天后永久删除)
1) 将 `jobs_trash_after_days`/`jobs_purge_after_days` 配置为较小值(例如分钟级,用于验证)
2) 训练完成进入 terminal 状态
3) 等待 API server 内置 janitor 扫描周期后,确认对应 `jobs/<submission_id>` 被移动到 `trash/jobs/<submission_id>`
4) 在回收站窗口内,把某个文件从 `trash/jobs/<submission_id>` 移动到 `models/`,确认移动成功
5) 等待超过 `jobs_purge_after_days` 后,确认 `trash/jobs/<submission_id>` 被永久删除
6) 确认已移动到 `models/` 的文件不被删除
### D. 安全隔离(必须)
1) `bob` 不能通过 API 查询 `alice` 的 task404
2) `bob` 不能提交引用 `/private/users/alice/...` 的 TaskSpec400/403
3) `bob` 通过 SFTP 无法访问 `/private/users/alice/...`chroot 生效)
## 3) 故障注入(推荐通过)
1) kill worker watchdog 或 raylet → worker 自动恢复并重新加入集群
2) 重启 head 容器 → head 重新写 `head.json`worker 自动重连
3) SFTPGo 重启 → 不影响 Ray 集群;用户可重新连接上传/下载
## 4) 回归清单(与 v2.5 一致)
- 任务队列、重试INSUFFICIENT_RESOURCES → PENDING_RESOURCES → retry
- PPO/GRPO/SFT 三种 workload 均可跑通
- head 不跑训练driver 强制落 worker

109
specs/mvp/v3.0/v3.0_api.md Normal file
View File

@ -0,0 +1,109 @@
# MVP v3.0 — API 扩展设计(基于 v2.5
v3.0 的原则是:**尽量复用 v2.5 API**,只增量增加 “数据闭环” 与 “WebUI 支持” 所需的最小接口。
## 1) 认证与权限
沿用 v2.5
- Header`Authorization: Bearer <token>`
- admin token来自 `MVP_INTERNAL_TOKEN`
- 普通用户 token由 admin 颁发并持久化在 SQLite
权限规则:
- 非 admin只能访问自己的 task、自己的数据空间`/private/users/<user_id>/...`)。
- 跨用户访问返回 404不泄露存在性
## 2) 用户与 SFTPGo 联动(管理员接口)
### 2.1 创建用户(复用 v2.5
`POST /api/v2/users`
- v3.0 行为:成功后,**可选**联动创建 SFTPGo 用户
- v3.0 默认启用联动:创建 SFTPGo 用户 + 生成一次性密码password 认证)
- v3.0 仅保留该方案(方案 A不做外部认证/SSO 集成(留到更后续版本)
- `data.sftpgo.admin_api_base` 推荐形如:`http://argus-sftpgo:8080/api/v2`(包含 `/api/v2` 前缀)
### 2.2 下发 token复用 v2.5
`POST /api/v2/users/{user_id}/tokens`
### 2.3 禁用用户(复用 v2.5
`POST /api/v2/users/{user_id}:disable`
- v3.0 行为:联动禁用 SFTPGo 用户(可选)
### 2.4 SFTP 凭据管理(新增,管理员或用户自助)
(具体由你确认 v3.0 需要“用户自助”还是“管理员操作”)
#### 重置 SFTP 密码(管理员)
`POST /api/v2/users/{user_id}/sftp:reset_password`
- 返回:一次性密码(只返回一次,服务端不保存明文)
> v3.0 先只做 password 方案SSH public key 作为后续版本可选增强(不在 v3.0 范围)。
## 3) 用户自助信息(新增)
### 3.1 获取当前用户信息
`GET /api/v2/me`
- 返回示例:
```json
{
"user_id": "alice",
"is_admin": false,
"paths": {
"home": "/private/users/alice",
"datasets": "/private/users/alice/datasets",
"models": "/private/users/alice/models",
"code": "/private/users/alice/code",
"jobs": "/private/users/alice/jobs",
"trash_jobs": "/private/users/alice/trash/jobs"
},
"retention": {
"jobs_trash_after_days": 3,
"jobs_purge_after_days": 7
},
"sftp": {
"host": "h1.example.internal",
"port": 2022,
"username": "alice"
}
}
```
## 3.2 Jobs Retention 提示(新增)
为了支撑 WebUI 展示与用户预期管理,可在 `/api/v2/me` 或单独接口返回:
- `jobs_trash_after_days`:默认 3
- `jobs_purge_after_days`:默认 7
- `jobs_root``/private/users/<me>/jobs`
- `trash_jobs_root``/private/users/<me>/trash/jobs`
- `recommendations`:提示用户把需要长期保存的产物移动到 `models/``datasets/`
## 4) 数据浏览/下载可选v3.0 最小化)
说明:上传/下载主通道仍是 SFTP。
WebUI 如果要提供“快速浏览/查看”,可实现只读接口(避免实现大文件上传/断点等复杂逻辑)。
### 4.1 列目录
`GET /api/v2/files?path=/private/users/alice`
- 权限path 必须在 `/private/common/``/private/users/<me>/`
- 返回文件列表name/type/size/mtime
### 4.2 下载文件(小文件为主)
`GET /api/v2/files:download?path=/private/users/alice/jobs/.../logs/...`
- 返回:流式下载
- 大文件仍建议走 SFTP
## 5) TaskSpec 路径校验升级v3.0 关键)
v2.5:仅允许 `/private/common/...`
v3.0:允许 `/private/common/...``/private/users/<me>/...`
应用字段(至少):
- `train_file` / `val_file`
- `code_path`:仍仅允许 `/private/common/...`v3.0 不支持执行用户 code
- 本地模型路径字段(如果引入):允许 `/private/users/<me>/models/...`
## 6) WebUI 路由(新增)
由 API server 托管:
- `GET /ui`:主页面
- `GET /ui/login`token 登录页
- 静态资源:`/ui/static/...`
WebUI 的所有操作均调用同源 API不额外开 CORS

View File

@ -0,0 +1,358 @@
# MVP v3.0 详细设计方案(基于 v2.5
## 0. 结论摘要v3.0 要交付什么)
v3.0 = v2.5 + **WebUI** + **用户数据上传/下载SFTPGo**,形成第一个可对外发布的版本:
- 用户可以通过 **SFTP** 上传数据/模型/代码(至少数据),落到 GPFS容器内 `/private`)并对 Ray worker 可见。
- 用户可以通过 API/WebUI 提交训练任务,任务读取自己上传的数据。
- 用户可以下载训练产物checkpoints/logs 等),最小闭环跑通。
## 1. 范围与原则
### 1.1 继承 v2.5 的前提(不回退)
- **Stateless Ray Node Pool**head 写 `head.json`worker watchdog 自动 join/自愈。
- **User Management**token 鉴权、任务可见性隔离(跨用户 404 不泄漏)。
- **作业产物隔离**Ray job 目录落到 `/private/users/<user_id>/jobs/<ray_submission_id>/...`
- **API server 短期运行方式**:代码在宿主机,挂载到 head 容器,在 head 容器内启动(保持现状)。
### 1.2 v3.0 新增目标
1) **Data ManagementSFTPGo**
- 提供用户上传/下载入口SFTP 为主)。
- 数据落到 GPFSdev 环境 NFS/GPFS生产环境 GPFS训练 job 在 worker 容器内可直接读取。
2) **WebUI**
- 用户可视化创建任务、查看队列/状态/日志、查看“数据路径约定”和自己的 SFTP 信息。
- 目标是 “可用而非豪华”,支持核心工作流。
3) **权限闭环**
- 用户只能使用自己目录下的数据(`/private/users/<user_id>/...`)或公共目录(`/private/common/...`)。
- 防止用户提交任务读取其他用户的文件路径。
### 1.3 v3.0 明确不做(留给 v3.5
- 不做 “自定义 reward function / 自定义 verl 代码 / 多版本 verl 共存”(路线图 v3.5)。
- 不做复杂 Serving/训推一体(路线图 v3.5)。
- 不做 IB 网络/拓扑优化(路线图 v3.5)。
- 不做系统级可观测性平台(路线图 v4.0)。
## 2. 架构概览
参考 `roadmap_v3.0.png`v3.0 的控制面与数据面:
### 2.1 控制面Control Plane
- **API ServerFastAPI**
- v2.5 的任务队列/调度/重试 + 用户管理能力继续复用
- 新增:数据管理能力(与 SFTPGo 对接) + WebUI
- **WebUI**
- 通过 API 使用 token 登录
- 提供任务/日志/数据入口(不直接运行训练)
- **Ray Head状态节点**
- 仍在 head 容器内(或单独节点)
- job server/dashbaord 提供 job submit/status/logs 能力
### 2.2 数据面Data Plane
- **GPFS容器内挂载 `/private`**
- 存放 common 与 users 两大根目录
- **Ray Worker Node无状态**
- 自动连接 head执行训练
- 读取 `/private/users/<user>/...` 的数据
### 2.3 新增组件SFTPGoData Management
- 作为独立服务运行(容器化优先),后端存储使用 **filesystem**GPFS 挂载路径)。
- 用户的 home directory 指向 `/private/users/<user_id>`(或其子目录)。
## 3. 存储与目录规范v3.0 统一约定)
### 3.1 目录层级
统一以容器内 `/private` 作为根路径dev/prod 对齐):
- `/private/common/`:公共资源
- `hf/`HF cache
- `datasets/`:公共数据集(可选)
- `code/`:公共代码(例如公共 verl repo snapshot
- `db/`SQLite队列、用户、token
- `logs/`API/supervisor/watchdog 日志
- `/private/users/<user_id>/`用户空间v3.0 重点)
- `datasets/`:用户上传的数据集(推荐)
- `models/`:用户保存/上传的本地模型(允许;也用于“把 job 产物移动到长期保存目录”)
- `code/`用户上传的代码v3.0 **不支持执行**;仅存放/下载)
- `jobs/`:训练任务产物(已在 v2.5 落地)
- `tmp/`:临时文件(可选)
### 3.2 Jobs Retention两段式3 天移入回收站7 天后永久删除)
v3.0 引入 **jobs 目录两段式保留策略**
- 第 1 阶段soft-deletejob 结束后 **3 天**,将该 job 目录从 `jobs/` **移动到用户回收目录**
- 第 2 阶段hard-delete进入回收目录后再过 **7 天**,从回收目录 **永久删除**
目录约定(建议):
- jobs 根目录:`/private/users/<user_id>/jobs/<ray_submission_id>/...`
- 回收目录:`/private/users/<user_id>/trash/jobs/<ray_submission_id>/...`
计时规则:
- 以 job 进入 terminal 状态SUCCEEDED/FAILED/CANCELED的结束时间为起点
- “3 天”用于从 `jobs/` 移入 `trash/jobs/`
- “7 天”用于从 `trash/jobs/` 永久删除(即总共最多 10 天窗口)。
用户保留关键产物的方式(无需 keep 标记):
- 在 “3 天窗口”内把需要长期保存的文件从 `jobs/<submission_id>/...` **移动/复制**到 `models/`(例如权重)或 `datasets/`(例如评估输出数据);
- 即便已被移动到回收目录,用户仍可在 “7 天窗口”内从 `trash/jobs/<submission_id>/...` 把需要的文件移到 `models/` / `datasets/`
- janitor 只管理 `jobs/``trash/jobs/`,不会触碰 `models/``datasets/`
这里的“清理程序”我们称为 **janitor**
- 定义:一个后台清理执行器,按固定周期扫描“已结束且已过期”的 job 目录并删除
- v3.0 目标实现“3 天移入回收站 + 7 天后删除”这一条产品规则(不提供 keep/延长保留标记)
实现建议(按你的偏好):
- **janitor 作为 API server 内置后台线程**运行:
- 优点:天然可访问 SQLite任务状态、结束时间、user_id、ray_submission_id并能把清理结果写回 events 表用于审计
- 部署更简单:不额外引入 cronjob/独立服务
- 删除/移动动作建议 **直接在 GPFS/NFS 文件系统上操作**API server 运行在 head 容器,已挂载 `/private`
- 第 1 阶段:`os.rename`(同文件系统原子移动)把 `jobs/<sid>` 移到 `trash/jobs/<sid>`
- 若跨文件系统(理论上不应发生),则降级为 copy+delete
- 移动前做严格路径前缀校验(必须在 `.../users/<u>/jobs/` 下)。
- 第 2 阶段:对 `trash/jobs/<sid>` 执行递归删除(例如 `shutil.rmtree`),同样做路径前缀校验(必须在 `.../users/<u>/trash/jobs/` 下)。
- 为什么不依赖 SFTPGo APISFTPGo 只是用户访问协议层SFTP/Web目录物理就在同一份文件系统文件系统直连更简单、也不依赖 SFTPGo 在线。
- 如果你强烈希望“通过 SFTPGo API 删除”:
- 可以作为可选实现/补充(例如用于统一审计或未来接入配额/策略但不建议作为唯一手段SFTPGo 停机不应阻塞清理)。
### 3.3 用户在 SFTPGo 内移动/整理文件(确认点)
支持用户在 SFTPGo 中进行“移动/重命名/整理”(例如把权重从 `jobs/` 移动到 `models/`
- 前提SFTPGo 用户权限允许对其 home 目录进行 `rename/mkdir/remove` 等操作v3.0 默认可写)。
- 行为:用户可以把 `jobs/` 下某些文件移动到 `models/``datasets/`,用于长期保存权重/评估产物等。
- 与 retention 的关系:只要文件被移动出 `jobs/`,就不会被 jobs 清理逻辑删除。
### 3.4 路径权限规则API 侧校验)
v2.5 约束是 “只允许 `/private/common/...`”。
v3.0 需要升级为:
- 允许:
- `/private/common/...`
- `/private/users/<current_user_id>/...`
- 禁止:
- 任何其他绝对路径(例如 `/private/users/other/...``/etc/...`
并把该规则应用到 TaskSpec 的相关字段(至少):
- `train_file` / `val_file`
- `code_path`:仍仅允许 `/private/common/...`v3.0 不支持执行用户 code
- 本地模型路径字段:允许 `/private/users/<me>/models/...`确认v3.0 允许)
## 4. SFTPGo 方案设计Data Management
### 4.1 运行形态
推荐用容器运行 SFTPGo与 Ray/API 解耦),挂载同一份 `/private`
- `sftpgo` 容器挂载 `../../shared:/private`
- 对外暴露:
- SFTP 端口(建议 2022
- WebAdmin/API 端口(建议 8081仅内网或管理员访问
#### 4.1.1 镜像来源(现成 Docker 镜像)
SFTPGo 有现成可用的 Docker 镜像(无需自建):
- v3.0 推荐优先使用官方/上游发布的 `sftpgo` 镜像作为运行基座
- 我们在 v3.0 里不需要定制 SFTPGo 代码,只需要:
- 正确挂载 GPFS/NFS容器内 `/private`
- 配置管理员账号(用于 API server 联动创建/禁用用户、重置密码)
- 配置每用户 home/chroot
> 注意:具体镜像名/tag 在不同环境可能有差异(官方/镜像仓库策略会变动)。落地时建议在 `argus@h1``docker search sftpgo` 或由你们内部镜像仓库提供固定版本v3.0 设计只要求“使用现成镜像”,不强依赖某个 tag。
#### 4.1.2 docker-compose 服务草案(示意)
下面给出一个**示意**(最终以实际镜像名/tag 与你们端口规划为准):
```yaml
services:
sftpgo:
image: sftpgo/sftpgo:latest # 示例:使用现成镜像
container_name: argus-sftpgo
ports:
- "2022:2022" # SFTP
- "8081:8080" # WebAdmin/API建议仅内网/管理员)
volumes:
- ../../shared:/private
- ../../shared/common/sftpgo:/var/lib/sftpgo # 持久化 SFTPGo 元数据(可选/建议)
environment:
# 管理员账号/密码(示意,具体变量名以镜像文档为准)
SFTPGO_ADMIN_USERNAME: "admin"
SFTPGO_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD}"
```
与 v3.0 的配合点:
- API server 使用 `data.sftpgo.admin_api_base` + admin 凭据联动创建用户
- 用户 home/chroot 统一指向 `/private/users/<user_id>`
### 4.2 用户隔离
每个用户在 SFTPGo 中的 home dir 绑定到:
- `/private/users/<user_id>`chroot用户只能读写自己的目录。
### 4.3 用户创建与凭据管理(两种实现,建议先做 A
**方案 Av3.0 推荐API Server 负责“联动创建 SFTPGo 用户”**
- 在 v2.5 的 `POST /api/v2/users` 成功后:
- API server 调用 SFTPGo 管理 API 创建同名用户
- 设置 home dir = `/private/users/<user_id>`
- 设置权限(默认可写;是否只读可配置)
- 认证方式:
- v3.0 最小可用:用户名+密码确认v3.0 先 passwordAPI 生成一次性密码,用户首次登录后要求改密)
- 或SSH public keyWebUI 允许上传 public keyAPI 写入 SFTPGo
**方案 B更强但复杂SFTPGo 外部认证**
- SFTPGo 把认证委托给 API servertoken/SSOSFTP 也走内部 token。
- 复杂度高,建议 v3.0 不做,放到 v3.5 或更后。
### 4.4 用户上传/下载体验
用户通过 SFTP 上传:
- `datasets/...`(训练数据)
- `models/...`(本地模型,可选)
下载:
- `jobs/<ray_submission_id>/...`checkpoints/logs
WebUI/文档提供 “路径如何写进 TaskSpec” 的指引。
## 5. WebUI 方案设计(最小可用)
### 5.1 目标页面
v3.0 WebUI 采用“**多子页面 + 侧边导航栏**”而不是把所有功能挤到单页:
- 原因信息密度更可控后续可扩展v3.5+)且不会把一个页面做成“巨型表单/巨型列表”。
- 实现仍保持轻量:服务端渲染(或静态 HTML + 少量 JS不引入复杂前端工程。
信息架构IA建议如下
1) **登录页**`/ui/login`
- 用户粘贴 token管理员发放浏览器保存localStorage/sessionStorage
- 提供“退出登录/清空 token”
2) **任务列表页**`/ui/tasks`
- 默认列表:最近 N 条任务(按 created_at 倒序)
- 支持过滤workload、stateQUEUED/RUNNING/SUCCEEDED/FAILED/CANCELED、时间范围
- 支持快捷操作:进入详情、取消任务
3) **新建任务页**`/ui/tasks/new`
- 两种模式(二选一,均可实现):
- **YAML 直接提交**:上传/粘贴 TaskSpec YAML最省开发
- **表单生成 YAML**:选择 workload填写核心字段train/val/model/nnodes/gpus生成 YAML 预览后提交
- 提交后跳转到任务详情页
4) **任务详情页**`/ui/tasks/{task_id}`
- 顶部task_id、workload、state、created_at、updated_at、error_summary
- Attempt 卡片latest attempt_no、ray_submission_id、ray_status、start/end
- 操作区:取消任务(若非 terminal、刷新状态、复制路径/ID
- 链接到日志页与产物提示SFTP 路径)
5) **任务日志页**`/ui/tasks/{task_id}/logs`
- 默认 tail=2000可选 200/1000/5000
- 提供“自动刷新(每 3~5 秒)”开关(简单轮询即可)
6) **数据页**`/ui/data`
- 显示 SFTP 连接信息host/port/username
- 显示用户目录约定:
- home`/private/users/<user_id>`
- datasets`/private/users/<user_id>/datasets`
- models`/private/users/<user_id>/models`
- jobs`/private/users/<user_id>/jobs`
- trash/jobs`/private/users/<user_id>/trash/jobs`
- 明确 retentionjobs 结束后 3 天移入回收站,回收站 7 天后删除;重要文件请移到 `models/``datasets/`
7) **(仅管理员可见)用户管理页**`/ui/admin/users`,可选但很有价值)
- 创建用户、禁用用户、签发 token、重置 SFTP 密码(方案 A
### 5.2 页面组织与导航(建议)
侧边栏导航(普通用户):
- Tasks列表
- New Task新建
- DataSFTP/目录说明)
管理员侧边栏额外增加:
- Admin / Users
### 5.3 大致示意图wireframe
下面是一个粗略示意(非最终 UI仅表达信息结构与布局
```
┌──────────────────────────────────────────────────────────────────────┐
│ Argus MVP v3.0 [user: alice] │
├───────────────┬──────────────────────────────────────────────────────┤
│ Side Nav │ /ui/tasks │
│ │ │
│ • Tasks │ [Filter] workload=all state=all [Search task_id] │
│ • New Task │ │
│ • Data │ Task List │
│ • Admin(*) │ ┌────────────────────────────────────────────────┐ │
│ │ │ task_id workload state ... │ │
│ │ │ mvp2-alice-ppo-... ppo RUNNING ... │ │
│ │ │ mvp2-alice-sft-... sft SUCCEEDED... │ │
│ │ └────────────────────────────────────────────────┘ │
│ │ [View] [Cancel] │
└───────────────┴──────────────────────────────────────────────────────┘
```
任务详情页(示意):
```
┌──────────────────────────────────────────────────────────────────────┐
│ /ui/tasks/{task_id} │
├──────────────────────────────────────────────────────────────────────┤
│ task_id: mvp2-alice-ppo-... state: RUNNING workload: ppo │
│ created_at: ... updated_at: ... │
│ error_summary: (empty) │
│ │
│ latest_attempt: a01 ray_submission_id: ...--a01 ray_status: RUNNING │
│ [Open Logs] [Cancel Task] [Refresh] │
│ │
│ Artifacts (SFTP paths): │
│ jobs/: /private/users/alice/jobs/<ray_submission_id>/ │
│ trash/: /private/users/alice/trash/jobs/<ray_submission_id>/ │
│ tip: move important files to /private/users/alice/models/ │
└──────────────────────────────────────────────────────────────────────┘
```
### 5.2 技术取舍(建议:不引入 Node 构建)
为了降低部署复杂度,建议 v3.0 WebUI 以 “服务端渲染 + 少量 JS/HTMX” 或 “纯静态 HTML+fetch” 实现:
- 由 API server 提供静态资源FastAPI StaticFiles
- 页面调用同源 API避免跨域与复杂前端构建链
## 6. API 扩展设计(概览)
v3.0 可以保持 `/api/v2/...` 不变,增量加:
- SFTPGo 集成管理端点(管理员):
- 创建/禁用用户时联动 SFTPGo
- 重置 SFTP 密码 / 更新 SSH key
- 用户数据端点(可选,最小化):
- `/api/v2/me`:返回 user_id、SFTP 信息host/port/home
- `/api/v2/files`:仅用于浏览/下载(上传仍走 SFTP
详细见 `specs/mvp/v3.0/v3.0_api.md`
## 7. 配置与部署v3.0 新增项)
`configs/dev.yaml` 基础上扩展一组 `data` 配置(示意):
```yaml
data:
shared_root: "/private" # 通常与 ray.shared_root 一致
user_root: "/private/users" # 用户空间根目录
allow_common_prefix: "/private/common/"
allow_user_prefix_template: "/private/users/{user_id}/"
sftpgo:
enabled: true
host: "127.0.0.1"
sftp_port: 2022
admin_api_base: "http://127.0.0.1:8081/api/v2"
admin_user: "admin"
admin_password_env: "SFTPGO_ADMIN_PASSWORD" # 仅 head 容器内可读
retention:
jobs_trash_after_days: 3
jobs_purge_after_days: 7
trash_root_template: "/private/users/{user_id}/trash/jobs"
janitor_interval_s: 3600 # 每小时扫一次(可配置)
```
## 8. 风险点与对策
1) **路径逃逸/越权读取**
- 必须在 API 提交任务时校验路径前缀
- SFTPGo 必须 chroot 到用户 home
2) **大文件上传稳定性**
- 优先用 SFTP断点续传/可靠性更好)
3) **用户 token 与 SFTP 凭据的生命周期**
- token 走 v2.5 SQLite
- SFTP 凭据建议独立(密码/SSH key并提供 reset 流程
4) **GPFS/NFS 权限**
- 确保 `/private/users/<user>` 目录权限可被 SFTPGo 写入且 worker 可读
## 9. 已确认结论(来自你的反馈)
1) 允许用户上传并在训练时使用自定义数据集:允许(`/private/users/<u>/datasets/...`)。
2) 允许用户上传并在训练时使用本地模型路径:允许(`/private/users/<u>/models/...`)。
3) v3.0 不允许执行用户自定义代码(不注入 `PYTHONPATH` 作为可执行 code path
4) SFTPGo 认证方式v3.0 先 password。
5) WebUI按“简单最小必要功能”做token 粘贴登录优先)。
## 10. 待确认问题(需要你给结论)
已确认jobs 清理执行主体v3.0 采用 **API server 内置 janitor 后台线程**

View File

@ -0,0 +1,232 @@
# MVP v3.0 开发计划TDD 驱动)
本文是 v3.0 的**工程化开发计划**强调“先写测试再写实现”TDD并将每个里程碑拆成**可独立验收**的小闭环。
输入依据:
- 路线图:`specs/mvp/mvp_roadmap_v2.md`
- v3.0 设计:`specs/mvp/v3.0/v3.0_design.md`
- v3.0 API`specs/mvp/v3.0/v3.0_api.md`
- v3.0 验收:`specs/mvp/v3.0/v3.0_acceptance.md`
- 现状基线v2.5Task queue + User mgmt + Stateless ray pool + 单镜像节点守护)
v3.0 已确认约束:
- 允许用户数据集路径:`/private/users/<me>/datasets/...`
- 允许用户本地模型路径:`/private/users/<me>/models/...`
- **不允许执行用户自定义代码**(不注入 user code 到 PYTHONPATH`code_path` 仍只允许 `/private/common/...`
- SFTPGo 先用 **password** 方案(方案 AAPI 联动创建/管理 SFTPGo 用户)
- jobs retention**3 天移入回收站trash/jobs再 7 天永久删除**;不提供 keep/延长保留标记
- janitor**API server 内置后台线程**;删除/移动采用**文件系统直接操作**(不依赖 SFTPGo API
---
## 0. TDD 规范(所有功能都遵循)
### 0.1 测试分层
1) **单元测试fast**
- 纯 Python 逻辑路径策略、SFTPGo client、retention 计算、文件移动/删除策略(用临时目录)。
- 不依赖真实 Ray、不依赖 docker、不依赖网络。
2) **组件测试(中等)**
- FastAPI 路由(含 WebUI 路由):`fastapi.testclient.TestClient`
- mock/stub SFTPGo client 与 ray client
3) **端到端(慢)**
- 在 `argus@h1` 通过 docker compose + scripts
- Ray 集群自动起来head+2 worker
- SFTPGo 服务可用
- 上传数据 → 提交训练 → 下载产物 → jobs 回收站/清理
### 0.2 代码与测试约定
- 测试目录:`src/mvp/py/tests/`
- 新功能必须先补齐测试用例,并让其在未实现时失败(红)
- 最小实现让测试变绿(绿)
- 再做重构(重构)
- 覆盖率:继续沿用当前阈值(>= 90%
---
## 1. 里程碑拆分v3.0 = 5 个可验证闭环)
### M1TaskSpec 路径策略升级(允许 user datasets/modelscode_path 仍仅 common
**目标**
- API submit 时的路径校验从 v2.5 的 “仅 `/private/common/`” 升级为:
- `train_file` / `val_file`:允许 `/private/common/...``/private/users/<me>/...`
- 本地模型路径:允许 `/private/users/<me>/models/...`(不改变 YAML 结构,见实现建议)
- `code_path`:仍仅允许 `/private/common/...`
- 阻止越权路径(`/private/users/other/...`)与非 `/private/...` 路径。
**实现建议(不扩展 TaskSpec**
- `model_id` 字段保持不变:
- 若 `model_id``/private/` 开头 → 视作本地模型路径
- 否则视作 HuggingFace repo id`Qwen/...`
**TDD 用例(先写测试)**
- 单测:
- `test_paths_allow_common_and_own_user_prefix()`
- `test_paths_reject_other_user_prefix()`
- `test_model_id_local_path_allowed_only_under_users_models()`
- `test_code_path_still_common_only()`
- API 测试:
- `test_submit_accepts_user_datasets_paths()`
- `test_submit_rejects_cross_user_paths_404_or_400()`(按约定返回 400/403
**验收点**
- `v3.0_acceptance.md` 的 D 类安全隔离用例可由 API 测试覆盖。
---
### M2SFTPGo 集成(方案 A用户联动创建 + password
**目标**
- 引入 `data management (SFTPGo)`
- admin 创建用户时联动创建 SFTPGo 用户home=/private/users/<user_id>chroot
- password 模式生成一次性密码reset/create并返回给 admin明文只返回一次
- 提供用户自助信息:
- `GET /api/v2/me` 返回 SFTP 连接信息、目录约定、retention 提示。
**实现建议**
- 新增 `SFTPGoAdminClient`(同步调用):
- 通过 `urllib``httpx`(建议 `urllib`,减少依赖;禁止 hard-code requests 使用)
- 支持create user / disable user / reset password最小集合
- API server 启动时校验配置enabled 时必须具备 admin 密码 env
- 同步创建用户目录结构(文件系统):
- `/private/users/<u>/{datasets,models,code,jobs,trash/jobs}`(最小必需)
**TDD 用例(先写测试)**
- 单测:
- `test_sftpgo_client_builds_correct_requests()`不发真实网络mock urlopen
- `test_user_dirs_created_on_user_create()`tmp dir 断言目录存在)
- API 测试:
- `test_create_user_calls_sftpgo_client()`stub client断言调用参数
- `test_me_returns_sftp_info_and_paths()`(含 trash/jobs 与 TTL 字段)
**验收点**
- `v3.0_acceptance.md` 的 A 类(用户/凭据)与 B 类(上传闭环前置)覆盖。
---
### M3WebUI最小可用多页面 + 侧边栏)
**目标**
- WebUI 由 API server 托管(同源,无额外 CORS
- `/ui/login`token 粘贴登录localStorage
- `/ui/tasks`:任务列表 + 过滤(最小)
- `/ui/tasks/new`YAML 提交(优先)+(可选)表单生成 YAML
- `/ui/tasks/{task_id}`:详情页
- `/ui/tasks/{task_id}/logs`:日志 tail + 可选自动刷新
- `/ui/data`SFTP 信息 + 目录/retention 提示
- (可选)`/ui/admin/users`:管理员用户管理(若时间允许,强烈建议)
**实现建议**
- 先不引入 Node 构建:
- HTML 模板可用最简单的字符串拼接或 Jinja2若引入 jinja2则补齐依赖与测试
- 页面通过 fetch 调用 `/api/v2/...`,并复用 token header
**TDD 用例(先写测试)**
- 组件测试TestClient
- `test_ui_routes_render_200()`
- `test_ui_contains_sidebar_links()`(简单断言文本包含导航链接)
- `test_ui_tasks_detail_shows_ids()`(包含 task_id、state、ray_submission_id
**验收点**
- WebUI 能完成:登录→创建任务→查看任务→查看日志→看到 data 页提示。
---
### M4Jobs Retention janitor3 天移入 trash7 天后 purge
**目标**
- API server 内置 janitor 后台线程:
- 周期性扫描 DB 中 terminal tasks
- 到期后执行:
- move`/private/users/<u>/jobs/<sid>``/private/users/<u>/trash/jobs/<sid>`
- purge递归删除 `/private/users/<u>/trash/jobs/<sid>`
- 全程严格 path 校验,禁止越界删除
- 清理操作记录到 DB events审计
**实现建议(数据与状态)**
- 需要稳定的时间锚点与幂等:
- 使用 attempts.end_time 作为 job 结束时间latest attempt
- 在 tasks 表新增字段(或新表)记录:
- `trashed_at`(首次成功 move 时间)
- `purged_at`(成功删除时间)
- `trash_path`(可选)
- 幂等:重复运行不会报错(目录不存在视为已处理)
**TDD 用例(先写测试)**
- 单测(用 tmpdir 构造 jobs/trash 目录):
- `test_janitor_moves_job_to_trash_after_threshold()`
- `test_janitor_purges_trash_after_threshold()`
- `test_janitor_never_touches_models_or_datasets()`
- `test_janitor_path_escape_rejected()`(恶意 path 不可删)
- API/组件测试:
- `test_me_includes_retention_fields()`jobs_trash_after_days/jobs_purge_after_days
**验收点**
- `v3.0_acceptance.md` 的 C2 用例可按“把阈值调小到分钟级”完成验证。
---
### M5端到端h1— SFTP 上传→训练→产物下载→回收站/清理
**目标**
- 在 `argus@h1` 落一个一键脚本(或手册)跑通:
1) `docker compose up -d` 拉起 Rayhead+2 worker+ SFTPGo
2) admin 创建用户 alice联动创建 SFTPGo 用户 + password
3) alice 通过 SFTP 上传:
- 数据集到 `/private/users/alice/datasets/...`
- (可选)本地模型到 `/private/users/alice/models/...`
4) alice 通过 API/WebUI 提交任务引用上述路径
5) 任务成功后:
- 从 `jobs/<sid>` 下载 logs/checkpoints
- 把权重移动到 `models/`,验证不会被清理
6) 把 retention 配置调小,验证 jobs→trash→purge
**交付建议**
- 新增脚本(命名示例):
- `scripts/run_all_v30_api.sh`
- `scripts/run_e2e_v30_cases.sh`
- 新增 `docker-compose.yaml` 中的 `sftpgo` service`docker-compose.v30.yaml` 叠加文件)
**验收点**
- `v3.0_acceptance.md` 全部 MUST 用例通过。
---
## 2. 风险与测试关注点
1) **权限与路径逃逸**
- path policy 必须覆盖train/val/model_id(local)/output dirsjobs/trash
- 所有删除/移动必须做 prefix 校验
2) **并发与竞态**
- janitor 只处理 terminal tasks避免清理正在写入的目录
- move 使用同文件系统 `os.replace`(原子)
3) **SFTPGo 可用性**
- SFTPGo 不在线不应影响训练与 API 核心功能(除了用户创建联动)
- janitor 不依赖 SFTPGo文件系统直连
---
## 3. 交付清单(代码/配置/脚本/文档)
### 3.1 代码
- Path policyv3.0
- SFTPGoAdminClient + user create/disable/reset password 联动
- `/api/v2/me` 扩展SFTP/目录/retention
- WebUI 路由与静态资源
- janitortrash+purge后台线程 + DB 记录
### 3.2 配置
- `configs/dev.yaml` 增加 `data.sftpgo``data.retention` 段(详见设计文档)
### 3.3 scripts / compose
- compose 增加 `sftpgo`(或新增 overlay compose 文件)
- v3.0 e2e 脚本(上传/下载/清理验证)
### 3.4 文档
- 更新 `specs/mvp/v3.0/*``src/mvp/README.md`运行方式、路径约定、SFTP 操作、retention 解释)

View File

@ -0,0 +1,154 @@
# MVP v3.0 进展记录milestone log
本文档用于记录 v3.0 按 `specs/mvp/v3.0/v3.0_dev_plan.md` 实施过程中的里程碑完成情况。
约定:每完成一个里程碑,追加一条记录,包含**日期**、**完成内容**、**涉及文件**、**验证方式/结果**、**待办/风险**。
---
## M1Path policy + tests已完成
- 日期2025-12-30
- 范围:按 v3.0 路径策略升级 API submit 的路径校验(不扩展 TaskSpec YAML 结构)。
- 完成内容:
- `code_path`:仍只允许 `/private/common/...`v3.0 不执行 user code
- `train_file`/`val_file`:允许 `/private/common/datasets/...``/private/users/<me>/datasets/...`
- `model_id`:若以 `/private/` 开头则视为本地路径,仅允许:
- `/private/common/models/...`
- `/private/users/<me>/models/...`
否则仍按 HuggingFace repo id`Qwen/...`)处理。
- 拒绝跨用户路径(例如 `bob` 提交 `/private/users/alice/datasets/...`)。
- 拒绝本地模型路径不在 `models/`(例如指向 `jobs/`)。
- 涉及文件:
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_users.py`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`54 passed`),覆盖率阈值保持 `>= 90%`
- 待办/风险:
- `model_id=/private/...` 的“本地模型路径语义”需要在用户文档/WebUI 中明确提示(避免误用)。
- 后续 M2/M3 需要把该路径策略同步到 UI 表单/提示文本(避免用户填错路径)。
---
## M2SFTPGo 集成(方案 A用户联动创建 + password已完成
- 日期2025-12-30
- 范围SFTPGoData Management最小集成 + 用户自助信息 `/api/v2/me` + 用户目录结构落盘。
- 完成内容:
- 新增 `data` 配置段:
- `data.user_root`:用户数据根目录(默认 `/private/users`
- `data.sftpgo`SFTPGo 可选联动enabled/host/sftp_port/admin_api_base/admin_user/admin_password_env
- `data.retention`jobs 过期策略配置3 天移入 trash7 天 purgejanitor 在 M4 实现)
- 新增 `SFTPGoAdminClient``urllib` 实现,不使用 `requests`
- `create_user` / `disable_user` / `reset_password`(最小集合)
- API server 增强:
- `POST /api/v2/users`:创建 DB user + 同步创建目录结构(`datasets/models/code/jobs/trash/jobs`
- 当 `data.sftpgo.enabled=true` 时,创建用户会联动调用 SFTPGo admin API并返回一次性密码明文仅返回一次服务端不保存
- `POST /api/v2/users/{user_id}:disable`禁用用户SFTPGo 禁用 best-effort
- `POST /api/v2/users/{user_id}/sftp:reset_password`管理员重置一次性密码SFTPGo enabled 才允许)
- `GET /api/v2/me`返回当前用户的目录约定、retention 提示以及可选SFTP 连接信息
- 同步更新 `src/mvp/configs/dev.yaml`:补齐 v3.0 相关 `data.*` 配置(默认关闭 sftpgo
- 涉及文件:
- `src/mvp/py/argus/service/config.py`
- `src/mvp/py/argus/service/sftpgo.py`
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_sftpgo.py`
- `src/mvp/py/tests/test_users.py`
- `src/mvp/py/tests/test_app.py`
- `src/mvp/py/tests/test_service_config.py`
- `src/mvp/configs/dev.yaml`
- `specs/mvp/v3.0/v3.0_api.md`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`62 passed`),覆盖率 `90.11%`(阈值 `>= 90%`)。
- 待办/风险:
- M2 仅做了“API 侧联动 + 单测”,未在真实 SFTPGo 容器上端到端验证(按计划在 M5 完成)。
- 目录创建依赖文件系统权限:生产部署时需确保 API/head 容器对 `/private/users` 可写。
---
## M3WebUI最小可用多页面 + 侧边栏)(已完成)
- 日期2025-12-30
- 范围API server 托管最小 WebUI同源不引入 Node 构建),用于登录/提交/查看任务与日志、查看 data 信息。
- 完成内容:
- 新增 UI 路由HTML+少量 JS
- `/ui`(重定向到 tasks
- `/ui/login`token 粘贴并写入浏览器 localStoragekey=`mvp_token`
- `/ui/tasks`:任务队列列表(调用 `/api/v2/queue`
- `/ui/tasks/new`:提交 TaskSpec YAMLPOST `/api/v2/tasks`
- `/ui/tasks/{task_id}`任务详情GET `/api/v2/tasks/{task_id}`,支持 cancel
- `/ui/tasks/{task_id}/logs`日志查看GET `/api/v2/tasks/{task_id}/logs`,可选自动刷新)
- `/ui/data`:展示 `/api/v2/me` 返回的路径/SFTP/retention 信息
- 统一侧边栏导航Tasks / New Task / Data / Login。
- UI 不做服务端 session所有 API 调用均由浏览器带 `Authorization: Bearer <token>`localStorage 注入)。
- 涉及文件:
- `src/mvp/py/argus/service/ui.py`
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_ui.py`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`65 passed`),覆盖率 `90.53%`(阈值 `>= 90%`)。
- 待办/风险:
- WebUI 当前为“骨架+API 驱动”,不做复杂交互与大文件下载;上传/下载仍以 SFTP 为主(按设计)。
- Starlette TestClient 的 `allow_redirects` 有弃用告警(不影响功能,可在后续清理)。
---
## M4Jobs Retention janitor3 天移入 trash7 天后 purge已完成
- 日期2025-12-30
- 范围API server 内置后台线程,对“已结束 attempt”的 job 目录执行保留策略(文件系统直连,不依赖 SFTPGo
- 完成内容:
- 新增 `JobsJanitor`
- 以 `attempts.end_time` 为基准计算 TTL从 job 结束开始算)
- `>= 3 天 && < 7 天`:把目录从 `.../jobs/<ray_submission_id>` 移动到 `.../trash/jobs/<ray_submission_id>`
- `>= 7 天`:确保目录进入 trash 后删除(`shutil.rmtree`
- 对缺失目录、异常移动/删除为 best-effort不影响服务主流程
- DB 增强:新增查询 `list_ended_attempts_before()`,用于 janitor 扫描候选 attempt。
- API server 启动时启动 janitor 线程(可通过 `data.retention.janitor_interval_s` 控制;<=0 视为关闭)。
- 涉及文件:
- `src/mvp/py/argus/service/janitor.py`
- `src/mvp/py/argus/service/db.py`
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_janitor.py`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`75 passed`),覆盖率 `90.72%`(阈值 `>= 90%`)。
- 待办/风险:
- M4 只做“逻辑 + 单测”,实际 `/private/users/...` 的权限与在 `argus@h1` 的行为验证放到 M5端到端
---
## M5端到端h1— SFTPGo compose + v3.0 E2E 脚本(已完成:交付脚本/配置)
- 日期2025-12-30
- 范围:补齐 h1 端到端所需的 compose/service、配置与一键脚本实际运行/验收由你在 `argus@h1` 执行)。
- 完成内容:
- SFTPGo 集成到 `docker compose`
- 新增 `argus-sftpgo` serviceSFTP 2022Admin API/UI 8080→host 8081避免与 MVP API 8080 冲突)
- 同挂载 `../../shared:/private`,并持久化元数据到 `../../shared/common/sftpgo`
- SFTPGoAdminClient 实装(对齐 upstream OpenAPI
- `GET /api/v2/token`BasicAuth获取 admin token
- `POST /api/v2/users` 创建用户(含 `permissions: {"/":["*"]}`
- `PUT /api/v2/users/{username}` 禁用/重置密码
- 新增 v3.0 dev 配置:`configs/dev_v30.yaml`(启用 `data.sftpgo` 并配置 `admin_api_base=http://argus-sftpgo:8080/api/v2`
- 新增 v3.0 一键脚本:
- `scripts/run_all_v30_api.sh`:起 Ray+SFTPGo、启动 API、创建用户并提交 PPO/GRPO/SFT引用 user dataset 路径)
- `scripts/run_e2e_v30_cases.sh`:最小 E2E runnerHP-1
- API 启动脚本增强:`scripts/60_start_api.sh` 支持透传 `SFTPGO_ADMIN_PASSWORD` 到 head 容器内的 API 进程。
- 涉及文件:
- `src/mvp/docker-compose.yaml`
- `src/mvp/configs/dev_v30.yaml`
- `src/mvp/scripts/run_all_v30_api.sh`
- `src/mvp/scripts/run_e2e_v30_cases.sh`
- `src/mvp/scripts/60_start_api.sh`
- `src/mvp/py/argus/service/sftpgo.py`
- `src/mvp/py/tests/test_sftpgo.py`
- `src/mvp/README.md`
- `specs/mvp/v3.0/v3.0_api.md`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`75 passed`),覆盖率 `90.35%`(阈值 `>= 90%`)。
- 待办/风险:
- 需要你在 `argus@h1` 实跑 `scripts/run_all_v30_api.sh` 完成真正的 SFTP 上传/下载与 retention 验收(按 `v3.0_acceptance.md`)。

View File

@ -0,0 +1,166 @@
# MVP v3.0 迭代总结Ray + SFTPGo + API + WebUI
本文总结 v3.0 迭代最终落地的功能、架构、运行方式、验收点与已知限制,便于后续评审、交接与继续迭代。
相关更详细文档:
- `specs/mvp/v3.0/v3.0_design.md`
- `specs/mvp/v3.0/v3.0_api.md`
- `specs/mvp/v3.0/v3.0_dev_plan.md`
- `specs/mvp/v3.0/v3.0_acceptance.md`
- `specs/mvp/v3.0/v3.0_progress.md`
---
## 1. 目标与范围
v3.0 作为“第一版可发布”的最小闭环,主要新增:
- **WebUI**:最小可用的人机界面(登录、任务提交与查看、数据入口、管理员入口)。
- **用户管理**:基于内部 token 的用户体系admin 与普通用户),支持创建用户与签发 token。
- **数据管理入口SFTPGo**:用户通过 SFTP/WebClient 上传下载自己的数据;同时暴露只读的共享数据/缓存目录common用于复用。
- **保持训练闭环**:仍通过 Ray Job 提交到集群执行PPO/GRPO/SFT 三类 workload 都验证)。
明确不做(本迭代保持最小):
- 不支持用户自定义训练代码TaskSpec 的 `code_path` 固定走 common 下的 verl snapshot 策略)。
- 不做复杂资源排队优化/多集群/多租隔离策略(目前隔离粒度主要在用户 jobs 目录层)。
---
## 2. 系统架构(最终形态)
核心组件:
- **Ray 集群(容器)**
- `argus-ray-head`head 节点(无 GPU/不跑训练),提供 Ray Dashboard 与 Job Server。
- `argus-ray-worker-0/1`worker 节点(有 GPU承载训练任务。
- worker 以 “stateless + watchdog 自动连接 head” 的方式加入集群。
- **API Server运行在 head 容器内)**
- 读取 YAML 配置dev/prod维护任务队列sqlite并周期性调度将任务提交到 Ray。
- 同时承载 WebUI`/ui`)。
- **SFTPGo容器**
- 提供 SFTP端口 `2022`)与 Web Client/Admin端口 `8081` 映射到容器 8080
- 用户 home 为 `/private/users/<user>`,默认可读写。
- 额外提供 `/common/*` 共享只读入口(见第 4 节)。
- **共享存储NFS/GPFS 等挂载到容器内 `/private`**
- `/private/common`共享缓存hf、datasets、models、db、logs 等)。
- `/private/users/<user>`用户隔离目录jobs/datasets/models/code/trash 等)。
---
## 3. 任务与调度Task / Ray Job
### 3.1 Task平台概念
- 用户向 API 提交 TaskSpecYAML平台分配 `task_id`(可读、包含用户名)。
- `task_id` 对应内部状态机与重试逻辑;底层每次提交 Ray Job 会产生 attempt 与 `ray_submission_id`
### 3.2 Ray JobRay 概念)
- 真正执行训练的 driver 通过 Ray Job 运行在集群 worker 上(避免 head 承载训练)。
- head 节点通过 `--num-cpus=0` / 自定义资源等策略避免调度到 head。
### 3.3 VERL 资源预检查的处理
- VERL 在创建资源池时会做 fail-fast 资源预检查(如“可用 GPU 不足”直接报错退出)。
- v3.0 延续 v2.x 的策略:服务端识别失败原因并按策略重试/回退(具体见 scheduler 实现与 v2.5/3.0 文档)。
---
## 4. 数据管理SFTPGo与 common 只读目录
### 4.1 用户目录(读写)
- 用户通过 SFTP/WebClient 访问自己的 home`/private/users/<user>`
- 目录结构(至少):`datasets/ models/ code/ jobs/ trash/ common/`
### 4.2 common 只读(方案 AVirtual Folder
本迭代采用 SFTPGo 的 Virtual Folder + 路径权限覆盖,实现用户可读共享目录但不可写。
最终对外暴露为:
- `/common/datasets`(只读)
- **mapped_path 指向真实目录 `/private/datasets`**(避免 `/private/common/datasets` 中大量 symlink 导致的 WebClient “权限不足/越界”问题)
- `/common/hf`(只读)
- mapped_path 指向 `/private/hf`
备注:
- `/private/common/datasets` 内部存在 symlink`gsm8k -> /private/datasets/gsm8k`),如果虚拟目录映射到 symlink 根目录SFTPGo 会把 symlink 跳转视为“逃逸 root”导致点击进入时报权限不足因此选择直接映射到真实目录根。
---
## 5. WebUI最小可用
入口:
- `/ui/login`:粘贴 token存 browser `localStorage`
- `/ui/tasks`任务列表Running/Pending/CompletedCompleted 支持分页
- `/ui/tasks/new`提交任务PPO/GRPO/SFT 三套样例可一键填充)
- `/ui/data`:展示当前用户名、支持重置 SFTPGo 密码并复制;提供跳转到 SFTPGo WebClient提示 FileZilla 等客户端用法
- `/ui/admin`:管理员入口(创建用户、签发 token、用户列表
- 导航栏提供 Ray Dashboard 快捷跳转(当前 IP 的 `:8265`
关于 admin 页面权限:
- admin 页面本身可访问,但其数据请求必须携带 admin token否则会在页面内显示 401/403/错误信息(满足“需要先提供 admin token 才能看到内容”)。
---
## 6. APIv3.0 新增/强化点)
核心接口(节选):
- 认证:
- Bearer token`MVP_INTERNAL_TOKEN`admin或用户 token由 admin 签发)
- 用户管理admin
- `POST /api/v2/users` 创建用户(并初始化用户目录)
- `GET /api/v2/users` 获取用户列表(包含最新 token、创建/更新时间等)
- `POST /api/v2/users/{user_id}/tokens` 签发用户 token
- 任务:
- `POST /api/v2/tasks` 提交 TaskSpecYAML
- `GET /api/v2/tasks` 任务列表(支持 states/limit/offset用于 Completed 分页)
- `GET /api/v2/tasks/{task_id}``POST /api/v2/tasks/{task_id}:cancel``GET /api/v2/tasks/{task_id}/logs`
- `GET /api/v2/queue`(运行中/待调度概览)
- 数据/SFTP
- `GET /api/v2/me` 返回用户路径信息、SFTP 连接信息,并 best-effort 对齐 SFTPGo 用户配置
- `POST /api/v2/me/sftp:reset_password` 用户自助重置 SFTPGo 密码(一次性返回明文)
安全取舍说明(当前为内网/开发优先):
- 为了 Admin WebUI “可查看并复制 token”数据库持久化存储了 `token_plain`(明文 token
- 这在生产场景通常不建议;未来可改为只展示“重置/重新签发”而不回显明文,或只回显一次。
---
## 7. 持久化与清理
- 任务队列sqliteWAL 模式)
- SFTPGo自带 sqlite db容器挂载持久化目录
- Jobs 目录清理策略(服务端 janitor
- job 结束后 3 天移动到回收目录trash
- 回收目录再保留 7 天后删除
---
## 8. 运行方式与脚本
开发/验收脚本:
- `src/mvp/scripts/run_all_v30_api.sh`端到端拉起Ray + SFTPGo + API并通过 API 提交 PPO/GRPO/SFT等待完成并验收
- 其他脚本用于启动/停止 API、准备数据与模型、探测服务就绪等详见 scripts 目录与 README
典型端到端(示例参数):
- `MVP_INTERNAL_TOKEN=my-dev-token`
- `SFTPGO_ADMIN_PASSWORD=my-dev-sftpgo-admin`
- 支持 `RESET_DB/RESET_SFTPGO` 用于测试环境重置
---
## 9. 验证结果(已跑通)
`argus@h1` 环境中已完成端到端验证:
- Ray 集群可用head + 2 worker
- API server + WebUI 可用
- SFTPGoadmin + 普通用户)可用
- 通过 API 连续提交 PPO/GRPO/SFT 三种任务均能完成SUCCEEDED
- 用户可以登录 SFTPGo WebClient/SFTP访问自己的目录并访问 `/common/datasets``/common/hf` 的只读内容
同时本地单测通过:
- pytest 全绿
- 覆盖率阈值 >= 90%
---
## 10. 已知限制 & 后续可改进
- WebUI 当前为最小版,交互与权限提示仍偏“工程化”而非产品化(后续可增强错误提示、搜索筛选、任务详情聚合等)。
- token 明文持久化仅适合内网/开发场景;生产建议改为一次性展示或支持撤销/轮换策略。
- SFTPGo 虚拟目录目前保留了历史遗留映射(例如 `/common/models` 可能残留),后续可在升级脚本中做一次性清理与迁移。

View File

@ -16,3 +16,11 @@
- CLI 提交流程:`scripts/run_all_cli.sh`
- API 提交流程:`scripts/run_all_api.sh`
- v2.5Stateless worker + user 隔离 jobsE2E`scripts/run_all_v25_api.sh`
- v3.0WebUI + SFTPGo + user datasets/models + jobs retentionE2E`scripts/run_all_v30_api.sh`
v3.0 访问入口dev/h1
- WebUI`http://127.0.0.1:8080/ui`
- Ray Dashboard`http://127.0.0.1:8265`
- SFTPGo
- SFTP`127.0.0.1:2022`
- Admin API/UI`http://127.0.0.1:8081`(容器内 8080host 映射到 8081 避免与 API server 冲突)

View File

@ -32,3 +32,24 @@ service:
retry_interval_s: 60
max_running_tasks: 1
# v3.0: user data management (filesystem + SFTPGo)
data:
# All user writable data is placed under this root:
# /private/users/<user_id>/{datasets,models,code,jobs,trash/jobs}
user_root: "/private/users"
# SFTPGo is optional in dev; when enabled, admin endpoints will call SFTPGo admin API.
# Admin password is provided by env var `data.sftpgo.admin_password_env`.
sftpgo:
enabled: false
host: "" # shown to users via GET /api/v2/me
sftp_port: 2022
admin_api_base: "" # e.g. http://argus-sftpgo:8080
admin_user: "admin"
admin_password_env: "SFTPGO_ADMIN_PASSWORD"
# Jobs retention policy (v3.0 janitor): move to trash after 3d, purge after 7d.
retention:
jobs_trash_after_days: 3
jobs_purge_after_days: 7
janitor_interval_s: 3600

View File

@ -0,0 +1,51 @@
ray:
# Ray Job server address (head 容器内视角)
address: "http://127.0.0.1:8265"
# 共享根路径(容器内统一 /private对齐生产
shared_root: "/private"
# 强制 driver 落 workerhead 不跑训练)
entrypoint_num_cpus: 1
entrypoint_resources:
worker_node: 1
# 所有 job 通用 runtime env
runtime_env:
env_vars:
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
# v3.0 先不支持 user code 执行
user_code_path: "/private/user/code"
service:
api:
host: "0.0.0.0"
port: 8080
auth:
token_env: "MVP_INTERNAL_TOKEN"
sqlite:
db_path: "/private/common/db/mvp.sqlite3"
scheduler:
tick_s: 5
retry_interval_s: 60
max_running_tasks: 1
data:
user_root: "/private/users"
sftpgo:
enabled: true
# Returned to users by GET /api/v2/me. For h1 E2E, usually connect to the host IP.
host: "127.0.0.1"
sftp_port: 2022
# Admin API base should include /api/v2 (SFTPGo OpenAPI server base).
# From head container, access SFTPGo by service name on the compose network.
admin_api_base: "http://argus-sftpgo:8080/api/v2"
admin_user: "admin"
admin_password_env: "SFTPGO_ADMIN_PASSWORD"
retention:
jobs_trash_after_days: 3
jobs_purge_after_days: 7
janitor_interval_s: 3600

View File

@ -38,6 +38,36 @@ services:
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
# v3.0: Data management service (SFTPGo).
# - SFTP: 2022
# - Admin API/UI: 8080 (mapped to host 8081 to avoid collision with MVP API server on 8080)
#
# NOTE: This is for dev / h1 E2E. In prod you may use an internal mirror image/tag and different ports.
sftpgo:
image: drakkan/sftpgo:latest
container_name: argus-sftpgo
user: "0:0"
ports:
- "2022:2022"
- "8081:8080"
volumes:
- ../../shared:/private
- ../../shared/common/sftpgo:/var/lib/sftpgo
networks:
- argus-ray-net
environment:
# Create a default admin on first start (used by API server to manage users).
# Override on host as needed:
# export SFTPGO_ADMIN_PASSWORD=...
SFTPGO_DATA_PROVIDER__CREATE_DEFAULT_ADMIN: "true"
# Persist the sqlite DB under the mounted metadata dir; otherwise it defaults to a relative path.
SFTPGO_DATA_PROVIDER__NAME: "/var/lib/sftpgo/sftpgo.db"
SFTPGO_DEFAULT_ADMIN_USERNAME: "admin"
SFTPGO_DEFAULT_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
# Explicitly pin default ports via env-var schema (double-underscore for nesting).
SFTPGO_HTTPD__BINDINGS__0__PORT: "8080"
SFTPGO_SFTPD__BINDINGS__0__PORT: "2022"
ray_worker_0:
image: argus/argus-ray-node:v2.5
container_name: argus-ray-worker-0

View File

@ -1,6 +1,7 @@
from __future__ import annotations
import os
import secrets
import threading
from typing import Any
@ -12,7 +13,10 @@ from argus.ray.models import JobSpec, RayConfig
from .config import V2Config
from .db import Db
from .janitor import JobsJanitor
from .scheduler import Scheduler
from .sftpgo import SFTPGoAdminClient, SFTPGoError
from .ui import register_ui_routes
def _utc_now_iso() -> str:
@ -38,11 +42,48 @@ def create_app(config_path: str) -> FastAPI:
db.init()
scheduler = Scheduler(db=db, ray_cfg=ray_cfg, v2_cfg=v2_cfg)
janitor = JobsJanitor(
db=db,
user_root=v2_cfg.data.user_root,
trash_after_days=v2_cfg.data.retention.jobs_trash_after_days,
purge_after_days=v2_cfg.data.retention.jobs_purge_after_days,
interval_s=v2_cfg.data.retention.janitor_interval_s,
)
stop_flag = threading.Event()
tool = scheduler.tool
app = FastAPI(title="mvp-v2", version="2.0")
def _user_home(user_id: str) -> str:
base = v2_cfg.data.user_root.rstrip("/")
return f"{base}/{user_id}"
def _ensure_user_dirs(user_id: str) -> None:
home = _user_home(user_id)
# Note: create a physical /common directory under each user's home so that
# SFTPGo WebClient shows a "common" entry at the root. The actual shared
# content is exposed via SFTPGo virtual folders under /common/{datasets,models}.
for rel in ("datasets", "models", "code", "jobs", "trash/jobs", "common"):
os.makedirs(f"{home}/{rel}", exist_ok=True)
def _sftpgo_enabled() -> bool:
return bool(v2_cfg.data.sftpgo.enabled)
def _sftpgo_client() -> SFTPGoAdminClient:
cfg = v2_cfg.data.sftpgo
if not cfg.admin_api_base:
raise HTTPException(status_code=500, detail="sftpgo enabled but data.sftpgo.admin_api_base is empty")
pw = os.environ.get(cfg.admin_password_env, "")
if not pw:
raise HTTPException(status_code=500, detail=f"missing env: {cfg.admin_password_env}")
shared_root = ray_cfg.shared_root.rstrip("/")
return SFTPGoAdminClient(
admin_api_base=str(cfg.admin_api_base),
admin_user=str(cfg.admin_user),
admin_password=pw,
common_root=f"{shared_root}/common",
)
def _auth(req: Request) -> dict[str, Any]:
token_env = v2_cfg.auth.token_env
admin_token = os.environ.get(token_env, "")
@ -74,6 +115,9 @@ def create_app(config_path: str) -> FastAPI:
def _startup() -> None:
t = threading.Thread(target=scheduler.run_forever, args=(stop_flag,), daemon=True)
t.start()
if int(janitor.interval_s) > 0:
tj = threading.Thread(target=janitor.run_forever, args=(stop_flag,), daemon=True)
tj.start()
@app.on_event("shutdown")
def _shutdown() -> None:
@ -93,7 +137,42 @@ def create_app(config_path: str) -> FastAPI:
row = db.create_user(user_id=user_id, display_name=str(display_name) if display_name is not None else None)
except Exception as e:
raise HTTPException(status_code=409, detail=f"user create failed: {e!r}")
return {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")}
# v3.0: create user dir structure (datasets/models/code/jobs/trash).
try:
_ensure_user_dirs(user_id)
except Exception as e:
raise HTTPException(status_code=500, detail=f"failed to create user dirs: {e!r}")
out: dict[str, Any] = {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")}
# v3.0: scheme A (password) SFTPGo integration.
if _sftpgo_enabled():
pw = secrets.token_urlsafe(12)
try:
_sftpgo_client().create_user(username=user_id, password=pw, home_dir=_user_home(user_id))
except SFTPGoError as e:
# Make create user idempotent for retries:
# If the user already exists in SFTPGo, reset password and enable it.
if "http error: 409" in str(e):
try:
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
except Exception as e2:
raise HTTPException(status_code=502, detail=f"sftpgo upsert user failed: {e2!r}")
else:
raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}")
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}")
out["sftp"] = {
"username": user_id,
"password": pw, # one-time return to admin; do not persist plaintext
"host": v2_cfg.data.sftpgo.host,
"port": int(v2_cfg.data.sftpgo.sftp_port),
}
return out
@app.post("/api/v2/users/{user_id}/tokens")
async def issue_token(user_id: str, req: Request) -> dict[str, Any]:
@ -104,6 +183,12 @@ def create_app(config_path: str) -> FastAPI:
token = db.issue_token(user_id=user_id)
return {"user_id": user_id, "token": token}
@app.get("/api/v2/users")
async def list_users(req: Request, limit: int = 200) -> dict[str, Any]:
_require_admin(req)
lim = max(1, min(int(limit), 1000))
return {"users": db.list_users(limit=lim), "limit": lim}
@app.post("/api/v2/users/{user_id}:disable")
async def disable_user(user_id: str, req: Request) -> dict[str, Any]:
_require_admin(req)
@ -111,8 +196,122 @@ def create_app(config_path: str) -> FastAPI:
if not u:
raise HTTPException(status_code=404, detail="user not found")
db.disable_user(user_id=user_id)
if _sftpgo_enabled():
try:
_sftpgo_client().disable_user(username=user_id, home_dir=_user_home(user_id))
except Exception:
# Best-effort: DB state is source of truth for API auth; SFTPGo sync can lag.
pass
return {"user_id": user_id, "state": "DISABLED"}
@app.post("/api/v2/users/{user_id}/sftp:reset_password")
async def reset_sftp_password(user_id: str, req: Request) -> dict[str, Any]:
_require_admin(req)
u = db.get_user(user_id)
if not u:
raise HTTPException(status_code=404, detail="user not found")
if not _sftpgo_enabled():
raise HTTPException(status_code=400, detail="sftpgo is not enabled")
pw = secrets.token_urlsafe(12)
try:
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}")
return {"user_id": user_id, "password": pw}
@app.post("/api/v2/me/sftp:reset_password")
async def reset_my_sftp_password(req: Request) -> dict[str, Any]:
"""
v3.0 WebUI: allow a user to rotate their own SFTPGo password and copy it.
Note: SFTPGo does not allow reading the current password, so this endpoint returns a new one-time password.
"""
subject = _auth(req)
user_id = str(subject["user_id"])
u = db.get_user(user_id)
if not u:
raise HTTPException(status_code=404, detail="user not found")
if not _sftpgo_enabled():
raise HTTPException(status_code=400, detail="sftpgo is not enabled")
pw = secrets.token_urlsafe(12)
try:
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
except Exception as e:
raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}")
return {"user_id": user_id, "password": pw}
@app.get("/api/v2/me")
async def me(req: Request) -> dict[str, Any]:
subject = _auth(req)
user_id = str(subject["user_id"])
try:
_ensure_user_dirs(user_id)
except Exception:
# Best-effort: user may still be able to use API even if FS init fails.
pass
# Best-effort: reconcile SFTPGo user to include /common read-only mounts.
# This makes the virtual folders visible without requiring a password reset.
if _sftpgo_enabled() and not subject.get("is_admin"):
try:
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
except Exception:
pass
home = _user_home(user_id)
out: dict[str, Any] = {
"user_id": user_id,
"is_admin": bool(subject.get("is_admin")),
"paths": {
"home": home,
"datasets": f"{home}/datasets",
"models": f"{home}/models",
"code": f"{home}/code",
"jobs": f"{home}/jobs",
"trash_jobs": f"{home}/trash/jobs",
},
"retention": {
"jobs_trash_after_days": int(v2_cfg.data.retention.jobs_trash_after_days),
"jobs_purge_after_days": int(v2_cfg.data.retention.jobs_purge_after_days),
},
}
if _sftpgo_enabled():
out["sftp"] = {
"host": v2_cfg.data.sftpgo.host,
"port": int(v2_cfg.data.sftpgo.sftp_port),
"username": user_id,
}
return out
@app.get("/api/v2/tasks")
async def list_tasks(req: Request, limit: int = 200, offset: int = 0, states: str | None = None) -> dict[str, Any]:
subject = _auth(req)
lim = max(1, min(int(limit), 1000))
off = max(0, int(offset))
state_list: list[str] | None = None
if states:
raw = [s.strip() for s in str(states).split(",") if s.strip()]
# Keep it strict to avoid typos silently returning empty results.
allowed = {
"QUEUED",
"PENDING_RESOURCES",
"SUBMITTING",
"SUBMITTED",
"RUNNING",
"SUCCEEDED",
"FAILED",
"CANCELED",
}
for s in raw:
if s not in allowed:
raise HTTPException(status_code=400, detail=f"invalid state: {s}")
state_list = raw or None
if subject.get("is_admin"):
tasks = db.list_tasks(user_id=None, states=state_list, limit=lim, offset=off)
else:
tasks = db.list_tasks(user_id=str(subject["user_id"]), states=state_list, limit=lim, offset=off)
return {"tasks": tasks, "limit": lim, "offset": off, "has_more": bool(len(tasks) == lim)}
@app.post("/api/v2/tasks")
async def submit_task(req: Request) -> dict[str, Any]:
subject = _auth(req)
@ -126,13 +325,40 @@ def create_app(config_path: str) -> FastAPI:
except Exception as e:
raise HTTPException(status_code=400, detail=f"invalid jobspec: {e!r}")
# v2.5 constraint: training inputs must come from /private/common (dev/prod统一)。
common_prefix = ray_cfg.shared_root.rstrip("/") + "/common/"
for k, v in (("code_path", spec.code_path), ("train_file", spec.train_file), ("val_file", spec.val_file)):
# v3.0 path policy:
# - code_path: only allow /private/common/...
# - train/val: allow /private/common/datasets/... OR /private/users/<me>/datasets/...
# - model_id: if it looks like a local path (/private/...), allow only models dirs:
# /private/common/models/... OR /private/users/<me>/models/...
root = ray_cfg.shared_root.rstrip("/")
common_prefix = f"{root}/common/"
user_prefix = f"{root}/users/{str(subject['user_id']).strip()}/"
common_datasets_prefix = f"{common_prefix}datasets/"
user_datasets_prefix = f"{user_prefix}datasets/"
common_models_prefix = f"{common_prefix}models/"
user_models_prefix = f"{user_prefix}models/"
if not str(spec.code_path).startswith(common_prefix):
raise HTTPException(status_code=400, detail=f"code_path must start with {common_prefix}")
for k, v in (("train_file", spec.train_file), ("val_file", spec.val_file)):
if v is None:
continue
if not str(v).startswith(common_prefix):
raise HTTPException(status_code=400, detail=f"{k} must start with {common_prefix}")
sv = str(v)
if not (sv.startswith(common_datasets_prefix) or sv.startswith(user_datasets_prefix)):
raise HTTPException(
status_code=400,
detail=f"{k} must start with {common_datasets_prefix} or {user_datasets_prefix}",
)
model_id = str(spec.model_id)
if model_id.startswith(f"{root}/"):
if not (model_id.startswith(common_models_prefix) or model_id.startswith(user_models_prefix)):
raise HTTPException(
status_code=400,
detail=f"model_id local path must start with {common_models_prefix} or {user_models_prefix}",
)
task_id = new_task_id(spec.workload, user_id=str(subject["user_id"]))
db.create_task_v25(
@ -257,4 +483,7 @@ def create_app(config_path: str) -> FastAPI:
return db.list_queue()
return db.list_queue(user_id=str(subject["user_id"]))
# v3.0: minimal WebUI (no server-side session; token stored in browser localStorage).
register_ui_routes(app)
return app

View File

@ -27,12 +27,37 @@ class V2SchedulerConfig:
max_running_tasks: int = 1
@dataclass(frozen=True)
class V2RetentionConfig:
jobs_trash_after_days: int = 3
jobs_purge_after_days: int = 7
janitor_interval_s: int = 3600
@dataclass(frozen=True)
class V2SFTPGoConfig:
enabled: bool = False
host: str = ""
sftp_port: int = 2022
admin_api_base: str = ""
admin_user: str = "admin"
admin_password_env: str = "SFTPGO_ADMIN_PASSWORD"
@dataclass(frozen=True)
class V2DataConfig:
user_root: str
sftpgo: V2SFTPGoConfig
retention: V2RetentionConfig
@dataclass(frozen=True)
class V2Config:
api: V2ApiConfig
auth: V2AuthConfig
sqlite: V2SqliteConfig
scheduler: V2SchedulerConfig
data: V2DataConfig
@staticmethod
def from_root_dict(root: dict[str, Any]) -> "V2Config":
@ -58,9 +83,19 @@ class V2Config:
else:
shared_root = str(root.get("shared_root") or "/private")
data = root.get("data") or {}
if not isinstance(data, dict):
raise ValueError("config.data must be a mapping")
sftpgo = data.get("sftpgo") or {}
retention = data.get("retention") or {}
if not isinstance(sftpgo, dict) or not isinstance(retention, dict):
raise ValueError("config.data.{sftpgo,retention} must be mappings")
default_db_path = f"{shared_root}/common/db/mvp.sqlite3"
db_path = str(sqlite.get("db_path") or default_db_path)
user_root = str(data.get("user_root") or f"{shared_root}/users")
return V2Config(
api=V2ApiConfig(
host=str(api.get("host") or "0.0.0.0"),
@ -73,4 +108,20 @@ class V2Config:
retry_interval_s=int(scheduler.get("retry_interval_s") or 60),
max_running_tasks=int(scheduler.get("max_running_tasks") or 1),
),
data=V2DataConfig(
user_root=user_root,
sftpgo=V2SFTPGoConfig(
enabled=bool(sftpgo.get("enabled") or False),
host=str(sftpgo.get("host") or ""),
sftp_port=int(sftpgo.get("sftp_port") or 2022),
admin_api_base=str(sftpgo.get("admin_api_base") or ""),
admin_user=str(sftpgo.get("admin_user") or "admin"),
admin_password_env=str(sftpgo.get("admin_password_env") or "SFTPGO_ADMIN_PASSWORD"),
),
retention=V2RetentionConfig(
jobs_trash_after_days=int(retention.get("jobs_trash_after_days") or 3),
jobs_purge_after_days=int(retention.get("jobs_purge_after_days") or 7),
janitor_interval_s=int(retention.get("janitor_interval_s") or 3600),
),
),
)

View File

@ -40,7 +40,8 @@ class Db:
user_id TEXT PRIMARY KEY,
display_name TEXT,
state TEXT NOT NULL,
created_at TEXT NOT NULL
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
)
"""
)
@ -49,6 +50,7 @@ class Db:
CREATE TABLE IF NOT EXISTS api_tokens (
token_hash TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
token_plain TEXT NOT NULL,
created_at TEXT NOT NULL,
last_used_at TEXT,
FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
@ -78,6 +80,15 @@ class Db:
conn.execute("ALTER TABLE tasks ADD COLUMN user_id TEXT")
except sqlite3.OperationalError:
pass
# Best-effort: add missing columns (forward compatibility).
try:
conn.execute("ALTER TABLE users ADD COLUMN updated_at TEXT")
except sqlite3.OperationalError:
pass
try:
conn.execute("ALTER TABLE api_tokens ADD COLUMN token_plain TEXT")
except sqlite3.OperationalError:
pass
conn.execute(
"""
CREATE TABLE IF NOT EXISTS attempts (
@ -168,10 +179,10 @@ class Db:
with self.tx() as conn:
conn.execute(
"""
INSERT INTO users (user_id, display_name, state, created_at)
VALUES (?, ?, 'ACTIVE', ?)
INSERT INTO users (user_id, display_name, state, created_at, updated_at)
VALUES (?, ?, 'ACTIVE', ?, ?)
""",
(user_id, display_name, now),
(user_id, display_name, now, now),
)
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_CREATED', ?)",
@ -183,7 +194,7 @@ class Db:
def disable_user(self, *, user_id: str) -> None:
now = _utc_now_iso()
with self.tx() as conn:
conn.execute("UPDATE users SET state = 'DISABLED' WHERE user_id = ?", (user_id,))
conn.execute("UPDATE users SET state = 'DISABLED', updated_at = ? WHERE user_id = ?", (now, user_id))
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_DISABLED', ?)",
(now, user_id),
@ -195,15 +206,18 @@ class Db:
return dict(row) if row else None
def issue_token(self, *, user_id: str) -> str:
# Returns plaintext token once; stores hash only.
# Returns plaintext token once.
# Note: For v3.0 WebUI admin convenience we persist the plaintext token so admins
# can re-copy it later. This is acceptable only for internal/dev deployments.
now = _utc_now_iso()
token = f"mvp_u_{user_id}_{secrets.token_urlsafe(18)}"
token_hash = self._hash_token(token)
with self.tx() as conn:
conn.execute(
"INSERT INTO api_tokens (token_hash, user_id, created_at) VALUES (?, ?, ?)",
(token_hash, user_id, now),
"INSERT INTO api_tokens (token_hash, user_id, token_plain, created_at) VALUES (?, ?, ?, ?)",
(token_hash, user_id, token, now),
)
conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, user_id))
conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'TOKEN_ISSUED', ?)",
(now, user_id),
@ -226,8 +240,48 @@ class Db:
return None
now = _utc_now_iso()
conn.execute("UPDATE api_tokens SET last_used_at = ? WHERE token_hash = ?", (now, token_hash))
conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, str(row["user_id"])))
return {"user_id": row["user_id"], "state": row["state"]}
def list_users(self, limit: int = 200) -> list[dict[str, Any]]:
with self._connect() as conn:
rows = conn.execute(
"""
SELECT
u.user_id,
u.display_name,
u.state,
u.created_at,
u.updated_at,
(
SELECT t.token_plain
FROM api_tokens t
WHERE t.user_id = u.user_id
ORDER BY t.created_at DESC
LIMIT 1
) AS token,
(
SELECT t.created_at
FROM api_tokens t
WHERE t.user_id = u.user_id
ORDER BY t.created_at DESC
LIMIT 1
) AS token_created_at,
(
SELECT t.last_used_at
FROM api_tokens t
WHERE t.user_id = u.user_id
ORDER BY t.created_at DESC
LIMIT 1
) AS token_last_used_at
FROM users u
ORDER BY u.created_at DESC
LIMIT ?
""",
(int(limit),),
).fetchall()
return [dict(r) for r in rows]
def get_task(self, task_id: str) -> dict[str, Any] | None:
with self._connect() as conn:
row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone()
@ -269,6 +323,40 @@ class Db:
running = conn.execute(running_sql, tuple(params)).fetchall()
return {"pending": [dict(r) for r in pending], "running": [dict(r) for r in running]}
def list_tasks(
self,
*,
user_id: str | None = None,
states: list[str] | None = None,
limit: int = 200,
offset: int = 0,
) -> list[dict[str, Any]]:
"""
Returns recent tasks (including terminal tasks).
"""
with self._connect() as conn:
params: list[Any] = []
where_clauses: list[str] = []
if user_id is not None:
where_clauses.append("user_id = ?")
params.append(user_id)
if states:
placeholders = ",".join(["?"] * len(states))
where_clauses.append(f"state IN ({placeholders})")
params.extend(states)
where_sql = f" WHERE {' AND '.join(where_clauses)}" if where_clauses else ""
sql = (
"SELECT task_id, user_id, workload, state, nnodes, n_gpus_per_node, latest_attempt_no, created_at, updated_at, error_summary "
"FROM tasks"
f"{where_sql} "
"ORDER BY created_at DESC "
"LIMIT ? OFFSET ?"
)
params.append(int(limit))
params.append(max(0, int(offset)))
rows = conn.execute(sql, tuple(params)).fetchall()
return [dict(r) for r in rows]
def count_running(self) -> int:
with self._connect() as conn:
row = conn.execute(
@ -383,3 +471,25 @@ class Db:
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_UPDATE', ?)",
(task_id, now, None),
)
def list_ended_attempts_before(self, *, end_time_le: str, limit: int = 2000) -> list[dict[str, Any]]:
"""
Used by the jobs janitor:
- Only considers tasks with a non-null user_id (v2.5+).
- Returns attempts that have end_time <= end_time_le.
"""
with self._connect() as conn:
rows = conn.execute(
"""
SELECT t.task_id, t.user_id, a.ray_submission_id, a.end_time
FROM attempts a
JOIN tasks t ON t.task_id = a.task_id
WHERE t.user_id IS NOT NULL
AND a.end_time IS NOT NULL
AND a.end_time <= ?
ORDER BY a.end_time ASC
LIMIT ?
""",
(str(end_time_le), int(limit)),
).fetchall()
return [dict(r) for r in rows]

View File

@ -0,0 +1,105 @@
from __future__ import annotations
import os
import shutil
import time
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from .db import Db
def _parse_iso_z(ts: str) -> datetime:
# Stored as "YYYY-MM-DDTHH:MM:SSZ" (naive UTC). Parse into aware UTC.
return datetime.fromisoformat(ts.replace("Z", "+00:00")).astimezone(timezone.utc)
def _iso_z(dt: datetime) -> str:
return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
@dataclass
class JobsJanitor:
db: Db
user_root: str
trash_after_days: int = 3
purge_after_days: int = 7
interval_s: int = 3600
def __post_init__(self) -> None:
if int(self.trash_after_days) < 0 or int(self.purge_after_days) < 0:
raise ValueError("retention days must be non-negative")
if int(self.purge_after_days) and int(self.purge_after_days) < int(self.trash_after_days):
raise ValueError("purge_after_days must be >= trash_after_days")
def _job_dir(self, *, user_id: str, ray_submission_id: str) -> str:
base = self.user_root.rstrip("/")
return f"{base}/{user_id}/jobs/{ray_submission_id}"
def _trash_dir(self, *, user_id: str, ray_submission_id: str) -> str:
base = self.user_root.rstrip("/")
return f"{base}/{user_id}/trash/jobs/{ray_submission_id}"
def tick_once(self, *, now: datetime | None = None, limit: int = 2000) -> None:
if int(self.trash_after_days) <= 0 and int(self.purge_after_days) <= 0:
return
now_dt = now or datetime.now(timezone.utc)
move_cutoff = now_dt - timedelta(days=int(self.trash_after_days))
move_cutoff_iso = _iso_z(move_cutoff)
rows = self.db.list_ended_attempts_before(end_time_le=move_cutoff_iso, limit=int(limit))
for r in rows:
user_id = str(r.get("user_id") or "").strip()
sid = str(r.get("ray_submission_id") or "").strip()
end_time = str(r.get("end_time") or "").strip()
if not user_id or not sid or not end_time:
continue
try:
ended_at = _parse_iso_z(end_time)
except Exception:
continue
age_days = (now_dt - ended_at).total_seconds() / 86400.0
src = self._job_dir(user_id=user_id, ray_submission_id=sid)
dst = self._trash_dir(user_id=user_id, ray_submission_id=sid)
dst_parent = os.path.dirname(dst)
# Between trash and purge: ensure in trash.
if age_days >= float(self.trash_after_days) and age_days < float(self.purge_after_days or 10**9):
if os.path.exists(src) and not os.path.exists(dst):
os.makedirs(dst_parent, exist_ok=True)
try:
shutil.move(src, dst)
except Exception:
pass
continue
# Purge: move to trash (if still in jobs) then delete from trash.
if int(self.purge_after_days) > 0 and age_days >= float(self.purge_after_days):
if os.path.exists(src) and not os.path.exists(dst):
os.makedirs(dst_parent, exist_ok=True)
try:
shutil.move(src, dst)
except Exception:
pass
if os.path.exists(dst):
try:
shutil.rmtree(dst)
except Exception:
pass
def run_forever(self, stop_flag: object) -> None:
try:
is_set = stop_flag.is_set # type: ignore[attr-defined]
except Exception:
raise ValueError("stop_flag must be a threading.Event-like object with is_set()")
while not is_set():
try:
self.tick_once()
except Exception:
pass
time.sleep(max(1, int(self.interval_s)))

View File

@ -0,0 +1,221 @@
from __future__ import annotations
import base64
import json
from dataclasses import dataclass
from urllib.error import HTTPError, URLError
from urllib.request import Request, urlopen
class SFTPGoError(RuntimeError):
pass
@dataclass(frozen=True)
class SFTPGoAdminClient:
"""
Minimal SFTPGo admin API client (v3.0).
SFTPGo OpenAPI documents the admin token flow:
- GET /api/v2/token with HTTP BasicAuth -> returns {"access_token": "..."}.
- Use Authorization: Bearer <token> for admin endpoints like POST /api/v2/users.
See upstream OpenAPI for details:
https://raw.githubusercontent.com/drakkan/sftpgo/main/openapi/openapi.yaml
"""
admin_api_base: str
admin_user: str
admin_password: str
common_root: str = "/private/common"
def _url(self, path: str) -> str:
base = self.admin_api_base.rstrip("/")
p = path if path.startswith("/") else f"/{path}"
return f"{base}{p}"
def _basic_auth_header(self) -> str:
raw = f"{self.admin_user}:{self.admin_password}".encode("utf-8")
return "Basic " + base64.b64encode(raw).decode("ascii")
def _get_json(self, url: str, headers: dict[str, str]) -> dict:
req = Request(url, headers=headers, method="GET")
try:
with urlopen(req, timeout=10) as resp:
return json.loads(resp.read().decode("utf-8"))
except HTTPError as e:
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
except URLError as e:
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
def _post_json(self, url: str, payload: dict, headers: dict[str, str]) -> None:
data = json.dumps(payload).encode("utf-8")
req = Request(url, data=data, headers=headers, method="POST")
try:
with urlopen(req, timeout=10) as resp:
_ = resp.read()
except HTTPError as e:
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
except URLError as e:
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
def _put_json(self, url: str, payload: dict, headers: dict[str, str]) -> None:
data = json.dumps(payload).encode("utf-8")
req = Request(url, data=data, headers=headers, method="PUT")
try:
with urlopen(req, timeout=10) as resp:
_ = resp.read()
except HTTPError as e:
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
except URLError as e:
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
def _admin_token(self) -> str:
url = self._url("/token")
obj = self._get_json(url, headers={"Authorization": self._basic_auth_header()})
tok = str(obj.get("access_token") or "").strip()
if not tok:
raise SFTPGoError("sftpgo token response missing access_token")
return tok
def _auth_headers(self, tok: str) -> dict[str, str]:
return {"Authorization": f"Bearer {tok}", "Content-Type": "application/json"}
def _ensure_folder(self, *, tok: str, name: str, mapped_path: str) -> None:
url = self._url("/folders")
try:
self._post_json(url, {"name": name, "mapped_path": mapped_path}, headers=self._auth_headers(tok))
except SFTPGoError as e:
# Idempotent + self-healing:
# If it already exists, update mapped_path to the desired value.
if "409" in str(e):
self._put_json(
self._url(f"/folders/{name}"),
{"name": name, "mapped_path": mapped_path},
headers=self._auth_headers(tok),
)
return
raise
def _ensure_common_folders(self, *, tok: str) -> None:
# Important: do NOT map datasets to /private/common/datasets because that path is
# a symlink farm (e.g. gsm8k -> /private/datasets/gsm8k) and SFTPGo WebClient
# will treat symlink traversal as escaping the virtual folder root, resulting
# in "permission denied". Map directly to the real datasets root.
common = self.common_root.rstrip("/")
if common.endswith("/common"):
shared_root = common[: -len("/common")] or "/private"
else:
shared_root = "/private"
self._ensure_folder(tok=tok, name="common_datasets", mapped_path=f"{shared_root}/datasets")
# Expose HF cache read-only so users can inspect downloaded models/datasets.
self._ensure_folder(tok=tok, name="common_hf", mapped_path=f"{shared_root}/hf")
def _get_user(self, *, tok: str, username: str) -> dict:
url = self._url(f"/users/{username}")
return self._get_json(url, headers={"Authorization": f"Bearer {tok}"})
def _put_user(self, *, tok: str, username: str, payload: dict) -> None:
url = self._url(f"/users/{username}")
self._put_json(url, payload, headers=self._auth_headers(tok))
def _apply_common_readonly(self, user_payload: dict) -> dict:
# Path-based permissions: make /common/* read-only while keeping home writeable.
perms = dict(user_payload.get("permissions") or {"/": ["*"]})
# Ensure /common is visible as a directory and can be traversed.
perms["/common"] = ["list"]
perms["/common/datasets"] = ["list", "download"]
perms["/common/hf"] = ["list", "download"]
user_payload["permissions"] = perms
desired_vf = [
{"name": "common_datasets", "virtual_path": "/common/datasets"},
{"name": "common_hf", "virtual_path": "/common/hf"},
]
existing = user_payload.get("virtual_folders") or []
if not isinstance(existing, list):
existing = []
seen = {(vf.get("name"), vf.get("virtual_path")) for vf in existing if isinstance(vf, dict)}
merged = list(existing)
for vf in desired_vf:
key = (vf["name"], vf["virtual_path"])
if key not in seen:
merged.append(vf)
user_payload["virtual_folders"] = merged
return user_payload
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
url = self._url("/users")
payload: dict = {
"status": 1,
"username": username,
"password": password,
"home_dir": home_dir,
"permissions": {
"/": ["*"],
"/common": ["list"],
"/common/datasets": ["list", "download"],
"/common/hf": ["list", "download"],
},
"virtual_folders": [
{"name": "common_datasets", "virtual_path": "/common/datasets"},
{"name": "common_hf", "virtual_path": "/common/hf"},
],
}
self._post_json(
url,
payload,
headers=self._auth_headers(tok),
)
def disable_user(self, *, username: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
cur = self._get_user(tok=tok, username=username)
payload: dict = {
"username": username,
"status": 0,
"home_dir": home_dir,
"uid": cur.get("uid", 0),
"gid": cur.get("gid", 0),
"permissions": cur.get("permissions") or {"/": ["*"]},
"virtual_folders": cur.get("virtual_folders") or [],
}
self._apply_common_readonly(payload)
self._put_user(tok=tok, username=username, payload=payload)
def enable_user(self, *, username: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
cur = self._get_user(tok=tok, username=username)
payload: dict = {
"username": username,
"status": 1,
"home_dir": home_dir,
"uid": cur.get("uid", 0),
"gid": cur.get("gid", 0),
"permissions": cur.get("permissions") or {"/": ["*"]},
"virtual_folders": cur.get("virtual_folders") or [],
}
self._apply_common_readonly(payload)
self._put_user(tok=tok, username=username, payload=payload)
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
cur = self._get_user(tok=tok, username=username)
payload: dict = {
"username": username,
"status": 1,
"home_dir": home_dir,
"uid": cur.get("uid", 0),
"gid": cur.get("gid", 0),
"permissions": cur.get("permissions") or {"/": ["*"]},
"virtual_folders": cur.get("virtual_folders") or [],
"password": new_password,
}
self._apply_common_readonly(payload)
self._put_user(tok=tok, username=username, payload=payload)

View File

@ -0,0 +1,629 @@
from __future__ import annotations
import html
import json
from fastapi import FastAPI
from fastapi.responses import HTMLResponse, RedirectResponse
_BASE_CSS = """
:root { --bg:#0b1020; --panel:#111a33; --muted:#95a3c6; --fg:#e8eeff; --accent:#7aa2ff; --danger:#ff6b6b; --ok:#3ddc97; }
* { box-sizing: border-box; }
body { margin:0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background:var(--bg); color:var(--fg); }
a { color:var(--accent); text-decoration:none; }
.layout { display:flex; min-height:100vh; }
.nav { width: 240px; padding:16px; background: linear-gradient(180deg, #0e1630, #0b1020); border-right: 1px solid rgba(255,255,255,0.06); }
.brand { font-weight: 700; letter-spacing: .2px; margin-bottom: 12px; }
.nav a { display:block; padding:10px 10px; border-radius:10px; color: var(--fg); opacity: .9; }
.nav a.active { background: rgba(122,162,255,0.14); border: 1px solid rgba(122,162,255,0.22); }
.nav a:hover { background: rgba(255,255,255,0.06); }
.main { flex:1; padding: 20px 24px; }
.card { background: rgba(255,255,255,0.04); border: 1px solid rgba(255,255,255,0.08); border-radius: 14px; padding: 16px; }
.row { display:flex; gap: 12px; align-items:center; flex-wrap: wrap; }
.muted { color: var(--muted); }
.btn { border: 1px solid rgba(255,255,255,0.16); background: rgba(255,255,255,0.06); color: var(--fg); padding: 10px 12px; border-radius: 10px; cursor: pointer; }
.btn:hover { background: rgba(255,255,255,0.10); }
.btn.danger { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.10); }
.pill { display:inline-block; padding: 2px 10px; border-radius: 999px; border: 1px solid rgba(255,255,255,0.16); font-size: 12px; }
.pill.ok { border-color: rgba(61,220,151,0.35); background: rgba(61,220,151,0.12); }
.pill.bad { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.12); }
textarea, input { width: 100%; color: var(--fg); background: rgba(255,255,255,0.06); border: 1px solid rgba(255,255,255,0.12); border-radius: 12px; padding: 10px 12px; outline: none; }
button:disabled { opacity: .45; cursor: not-allowed; }
pre { white-space: pre-wrap; word-break: break-word; }
table { width:100%; border-collapse: collapse; }
th, td { padding: 10px 8px; border-bottom: 1px solid rgba(255,255,255,0.08); text-align:left; }
""".strip()
_BASE_JS = """
function mvpTokenGet() {
return (localStorage.getItem("mvp_token") || "").trim();
}
function mvpTokenSet(v) {
localStorage.setItem("mvp_token", (v || "").trim());
}
function mvpSftpPasswordGet() {
return (localStorage.getItem("mvp_sftp_password") || "").trim();
}
function mvpSftpPasswordSet(v) {
localStorage.setItem("mvp_sftp_password", (v || "").trim());
}
async function apiFetch(path, opts) {
opts = opts || {};
opts.headers = opts.headers || {};
const tok = mvpTokenGet();
if (tok) opts.headers["Authorization"] = "Bearer " + tok;
return fetch(path, opts);
}
async function apiJson(path, opts) {
const resp = await apiFetch(path, opts);
const text = await resp.text();
if (!resp.ok) {
const err = new Error("HTTP " + resp.status);
err.status = resp.status;
err.body = text;
throw err;
}
return JSON.parse(text);
}
function fmtJson(obj) {
try { return JSON.stringify(obj, null, 2); } catch (e) { return String(obj); }
}
function curOriginWithPort(port) {
const proto = window.location.protocol;
const host = window.location.hostname;
return proto + "//" + host + ":" + port;
}
async function copyText(v) {
if (!v) return false;
try {
await navigator.clipboard.writeText(v);
return true;
} catch (e) {
// Fallback for non-secure contexts (http) or older browsers.
try {
const ta = document.createElement("textarea");
ta.value = v;
ta.style.position = "fixed";
ta.style.opacity = "0";
document.body.appendChild(ta);
ta.focus();
ta.select();
const ok = document.execCommand("copy");
document.body.removeChild(ta);
return ok;
} catch (e2) {
return false;
}
}
}
document.addEventListener("DOMContentLoaded", () => {
const el = document.getElementById("nav-ray-dashboard");
if (el) el.href = curOriginWithPort(8265);
});
""".strip()
def _nav(active: str) -> str:
links = [
("login", "/ui/login", "Login"),
("tasks", "/ui/tasks", "Tasks"),
("new", "/ui/tasks/new", "New Task"),
("data", "/ui/data", "Data"),
("admin", "/ui/admin", "Admin"),
("ray", "#", "Ray Dashboard"),
]
items = []
for key, href, label in links:
cls = "active" if key == active else ""
extra = ""
if key == "ray":
extra = ' id="nav-ray-dashboard" target="_blank" rel="noopener"'
items.append(f'<a class="{cls}" href="{href}"{extra}>{html.escape(label)}</a>')
return "\n".join(items)
def _page(title: str, active: str, body: str, script: str = "") -> str:
return f"""<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>{html.escape(title)}</title>
<style>{_BASE_CSS}</style>
</head>
<body>
<div class="layout">
<nav class="nav">
<div class="brand">Argus MVP</div>
{_nav(active)}
<div style="margin-top:12px" class="muted">
Token stored in browser localStorage as <code>mvp_token</code>.
</div>
</nav>
<main class="main">
{body}
</main>
</div>
<script>{_BASE_JS}</script>
<script>{script}</script>
</body>
</html>"""
def register_ui_routes(app: FastAPI) -> None:
@app.get("/ui")
async def ui_root() -> RedirectResponse:
return RedirectResponse(url="/ui/tasks")
@app.get("/ui/login")
async def ui_login() -> HTMLResponse:
body = """
<h1>Login</h1>
<div class="card">
<div class="muted">Paste your API token (without the <code>Bearer</code> prefix).</div>
<div style="height:10px"></div>
<input id="tok" placeholder="token..." />
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="save">Save</button>
<button class="btn" id="clear">Clear</button>
<a href="/ui/tasks" class="btn" style="display:inline-block">Go to Tasks</a>
</div>
<div style="height:10px"></div>
<div id="msg" class="muted"></div>
</div>
""".strip()
script = """
const tokEl = document.getElementById("tok");
const msg = document.getElementById("msg");
tokEl.value = mvpTokenGet();
document.getElementById("save").onclick = () => { mvpTokenSet(tokEl.value); msg.textContent = "Saved."; };
document.getElementById("clear").onclick = () => { mvpTokenSet(""); tokEl.value = ""; msg.textContent = "Cleared."; };
""".strip()
return HTMLResponse(content=_page("Login", "login", body, script))
@app.get("/ui/tasks")
async def ui_tasks() -> HTMLResponse:
body = """
<h1>Tasks</h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<a class="btn" href="/ui/tasks/new" style="display:inline-block">New Task</a>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
const out = document.getElementById("out");
async function refresh() {
out.textContent = "Loading...";
try {
const q = await apiJson("/api/v2/queue");
const completedLimit = 25;
const completedOffset = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const done = await apiJson("/api/v2/tasks?limit=" + completedLimit + "&offset=" + completedOffset + "&states=SUCCEEDED,FAILED,CANCELED");
function pill(state) {
const s = String(state || "");
if (s === "SUCCEEDED") return `<span class="pill ok">${s}</span>`;
if (s === "FAILED") return `<span class="pill bad">${s}</span>`;
if (s === "CANCELED") return `<span class="pill">${s}</span>`;
if (s === "RUNNING") return `<span class="pill ok">${s}</span>`;
if (s === "QUEUED" || s === "PENDING_RESOURCES" || s === "SUBMITTING" || s === "SUBMITTED") return `<span class="pill">${s}</span>`;
return `<span class="pill">${s}</span>`;
}
function row(t) {
const id = t.task_id;
return `<tr>
<td><a href="/ui/tasks/${id}">${id}</a></td>
<td>${t.workload}</td>
<td>${pill(t.state)}</td>
<td>${t.nnodes} x ${t.n_gpus_per_node} GPU</td>
<td>${t.updated_at || ""}</td>
</tr>`;
}
const running = (q.running || []).map(row).join("");
const pending = (q.pending || []).map(row).join("");
const doneRows = (done.tasks || []).map(row).join("");
const pageNo = Math.floor(completedOffset / completedLimit) + 1;
const prevDisabled = completedOffset <= 0;
const nextDisabled = !done.has_more;
out.innerHTML = `
<div class="muted">Tip: configure token in <a href="/ui/login">Login</a>.</div>
<div style="height:10px"></div>
<h3>Running</h3>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${running || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
<div style="height:12px"></div>
<h3>Pending</h3>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${pending || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
<div style="height:12px"></div>
<h3>Completed</h3>
<div class="row" style="justify-content: space-between; margin-bottom: 8px;">
<div class="muted">Page ${pageNo}</div>
<div class="row">
<button class="btn" id="done-prev" ${prevDisabled ? "disabled" : ""}>Prev</button>
<button class="btn" id="done-next" ${nextDisabled ? "disabled" : ""}>Next</button>
</div>
</div>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${doneRows || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
`;
const prevBtn = document.getElementById("done-prev");
const nextBtn = document.getElementById("done-next");
if (prevBtn) prevBtn.onclick = () => {
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const next = Math.max(0, cur - completedLimit);
localStorage.setItem("mvp_completed_offset", String(next));
refresh();
};
if (nextBtn) nextBtn.onclick = () => {
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const next = cur + completedLimit;
localStorage.setItem("mvp_completed_offset", String(next));
refresh();
};
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
}
document.getElementById("refresh").onclick = refresh;
refresh();
""".strip()
return HTMLResponse(content=_page("Tasks", "tasks", body, script))
@app.get("/ui/tasks/new")
async def ui_new_task() -> HTMLResponse:
ppo = """# PPO TaskSpec (YAML)
workload: ppo
nnodes: 2
n_gpus_per_node: 4
code_path: /private/common/code/verl/verl_repo
train_file: /private/common/datasets/gsm8k/train.parquet
val_file: /private/common/datasets/gsm8k/test.parquet
model_id: Qwen/Qwen2.5-0.5B-Instruct
""".strip()
grpo = """# GRPO TaskSpec (YAML)
workload: grpo
nnodes: 2
n_gpus_per_node: 4
code_path: /private/common/code/verl/verl_repo
train_file: /private/common/datasets/gsm8k/train.parquet
val_file: /private/common/datasets/gsm8k/test.parquet
model_id: Qwen/Qwen2.5-0.5B-Instruct
""".strip()
sft = """# SFT TaskSpec (YAML)
workload: sft
nnodes: 1
n_gpus_per_node: 1
code_path: /private/common/code/verl/verl_repo
train_file: /private/common/datasets/gsm8k_sft/train.parquet
val_file: /private/common/datasets/gsm8k_sft/test.parquet
model_id: Qwen/Qwen2.5-0.5B-Instruct
""".strip()
body = f"""
<h1>New Task</h1>
<div class="card">
<div class="muted">Paste TaskSpec YAML and submit to API server. Note: <code>code_path</code> is required (v3.0 does not execute user code; use the common snapshot).</div>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="tpl-ppo">PPO example</button>
<button class="btn" id="tpl-grpo">GRPO example</button>
<button class="btn" id="tpl-sft">SFT example</button>
</div>
<div style="height:10px"></div>
<textarea id="yaml" rows="16">{html.escape(ppo)}</textarea>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="submit">Submit</button>
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="msg" class="muted"></pre>
</div>
""".strip()
tpl_ppo = json.dumps(ppo)
tpl_grpo = json.dumps(grpo)
tpl_sft = json.dumps(sft)
script = (
"""
const msg = document.getElementById("msg");
const yamlEl = document.getElementById("yaml");
const TPL_PPO = __TPL_PPO__;
const TPL_GRPO = __TPL_GRPO__;
const TPL_SFT = __TPL_SFT__;
document.getElementById("tpl-ppo").onclick = () => { yamlEl.value = TPL_PPO; msg.textContent = ""; };
document.getElementById("tpl-grpo").onclick = () => { yamlEl.value = TPL_GRPO; msg.textContent = ""; };
document.getElementById("tpl-sft").onclick = () => { yamlEl.value = TPL_SFT; msg.textContent = ""; };
document.getElementById("submit").onclick = async () => {
msg.textContent = "Submitting...";
const body = yamlEl.value;
const resp = await apiFetch("/api/v2/tasks", { method: "POST", headers: {"Content-Type":"text/plain"}, body });
const text = await resp.text();
if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
msg.textContent = "OK: " + fmtJson(obj);
if (obj.task_id) window.location.href = "/ui/tasks/" + obj.task_id;
};
""".strip()
.replace("__TPL_PPO__", tpl_ppo)
.replace("__TPL_GRPO__", tpl_grpo)
.replace("__TPL_SFT__", tpl_sft)
)
return HTMLResponse(content=_page("New Task", "new", body, script))
@app.get("/ui/tasks/{task_id}")
async def ui_task_detail(task_id: str) -> HTMLResponse:
safe_id = html.escape(task_id)
body = f"""
<h1>Task: <code>{safe_id}</code></h1>
<div class="card">
<div class="row">
<a class="btn" href="/ui/tasks/{safe_id}/logs" style="display:inline-block">Logs</a>
<button class="btn" id="refresh">Refresh</button>
<button class="btn danger" id="cancel">Cancel</button>
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
async function refresh() {{
out.textContent = "Loading...";
const resp = await apiFetch("/api/v2/tasks/{task_id}");
const text = await resp.text();
if (!resp.ok) {{ out.textContent = "Error: " + resp.status + "\\n" + text; return; }}
out.textContent = fmtJson(JSON.parse(text));
}}
document.getElementById("refresh").onclick = refresh;
document.getElementById("cancel").onclick = async () => {{
if (!confirm("Cancel this task?")) return;
const resp = await apiFetch("/api/v2/tasks/{task_id}:cancel", {{ method: "POST" }});
const text = await resp.text();
out.textContent = (resp.ok ? "Canceled.\\n" : "Error: " + resp.status + "\\n") + text;
setTimeout(refresh, 800);
}};
refresh();
""".strip()
return HTMLResponse(content=_page(f"Task {task_id}", "tasks", body, script))
@app.get("/ui/tasks/{task_id}/logs")
async def ui_task_logs(task_id: str) -> HTMLResponse:
safe_id = html.escape(task_id)
body = f"""
<h1>Logs: <code>{safe_id}</code></h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<label class="muted">Auto refresh <input type="checkbox" id="auto" /></label>
<a class="btn" href="/ui/tasks/{safe_id}" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
let timer = null;
async function refresh() {{
const resp = await apiFetch("/api/v2/tasks/{task_id}/logs?tail=4000");
const text = await resp.text();
out.textContent = resp.ok ? text : ("Error: " + resp.status + "\\n" + text);
}}
document.getElementById("refresh").onclick = refresh;
document.getElementById("auto").onchange = (e) => {{
if (e.target.checked) {{
timer = setInterval(refresh, 2000);
}} else {{
if (timer) clearInterval(timer);
timer = null;
}}
}};
refresh();
""".strip()
return HTMLResponse(content=_page(f"Logs {task_id}", "tasks", body, script))
@app.get("/ui/data")
async def ui_data() -> HTMLResponse:
body = """
<h1>Data</h1>
<div class="card">
<div class="muted">User files live under your home directory. Keep long-term artifacts in <code>models/</code> or <code>datasets/</code>.</div>
<div style="height:14px"></div>
<div class="row">
<div style="flex:1; min-width:260px">
<div class="muted">Username</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<input id="u" readonly />
<button class="btn" id="copy-u">Copy</button>
</div>
</div>
<div style="flex:1; min-width:260px">
<div class="muted">SFTPGo password</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<input id="p" placeholder="Click Reset to generate..." />
<button class="btn" id="copy-p">Copy</button>
<button class="btn" id="reset-p">Reset</button>
</div>
</div>
</div>
<div style="height:12px"></div>
<div class="row">
<a class="btn" id="sftp-web" target="_blank" rel="noopener" href="#">Open SFTPGo Web Client (:8081)</a>
</div>
<div style="height:12px"></div>
<div class="muted">
You can also use an SFTP client (e.g. FileZilla) with the same username/password.
Host: <code id="sftp-host"></code>, Port: <code id="sftp-port"></code>.
</div>
<div style="height:14px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = """
const out = document.getElementById("out");
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const u = document.getElementById("u");
const p = document.getElementById("p");
const sftpWeb = document.getElementById("sftp-web");
const sftpHost = document.getElementById("sftp-host");
const sftpPort = document.getElementById("sftp-port");
document.getElementById("copy-u").onclick = async () => { await copyText(u.value || ""); };
document.getElementById("copy-p").onclick = async () => { await copyText(p.value || ""); };
async function refresh() {
const resp = await apiFetch("/api/v2/me");
const text = await resp.text();
if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
u.value = (obj.user_id || "");
const cached = mvpSftpPasswordGet();
if (cached) p.value = cached;
const host = curOriginWithPort(8081);
sftpWeb.href = host + "/web/client";
sftpHost.textContent = (obj.sftp && obj.sftp.host) ? obj.sftp.host : window.location.hostname;
sftpPort.textContent = (obj.sftp && obj.sftp.port) ? String(obj.sftp.port) : "2022";
out.textContent = fmtJson(obj);
}
document.getElementById("reset-p").onclick = async () => {
p.value = "";
const resp = await apiFetch("/api/v2/me/sftp:reset_password", { method: "POST" });
const text = await resp.text();
if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
p.value = obj.password || "";
mvpSftpPasswordSet(p.value);
out.textContent = "SFTPGo password rotated.\\n\\n" + fmtJson(obj);
};
refresh();
""".strip()
return HTMLResponse(content=_page("Data", "data", body, script))
@app.get("/ui/admin")
async def ui_admin() -> HTMLResponse:
body = """
<h1>Admin</h1>
<div class="card">
<div class="muted">
This page requires the <code>admin</code> token (set it in <a href="/ui/login">Login</a>).
</div>
<div style="height:14px"></div>
<h3>Create user</h3>
<div class="row">
<input id="new-user-id" placeholder="user_id (e.g. alice)" style="max-width:320px" />
<input id="new-display-name" placeholder="display_name (optional)" style="max-width:320px" />
<button class="btn" id="create-user">Create</button>
</div>
<div style="height:10px"></div>
<pre id="create-msg" class="muted"></pre>
<div style="height:14px"></div>
<div class="row">
<button class="btn" id="refresh">Refresh</button>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
const out = document.getElementById("out");
const createMsg = document.getElementById("create-msg");
const userIdEl = document.getElementById("new-user-id");
const displayNameEl = document.getElementById("new-display-name");
function esc(s) {
s = String(s || "");
return s.replaceAll("&","&amp;").replaceAll("<","&lt;").replaceAll(">","&gt;");
}
async function refresh() {
out.textContent = "Loading...";
try {
const obj = await apiJson("/api/v2/users?limit=200");
const users = (obj.users || []);
function row(u) {
const uid = u.user_id;
const tok = u.token || "";
const tokShort = tok ? (tok.length > 18 ? (tok.slice(0, 18) + "") : tok) : "";
const created = u.created_at || "";
const updated = u.updated_at || "";
const tCreated = u.token_created_at || "";
const tUsed = u.token_last_used_at || "";
return `<tr>
<td><code>${esc(uid)}</code></td>
<td class="muted">${esc(created)}</td>
<td class="muted">${esc(updated)}</td>
<td>
<div class="row" style="gap:8px">
<code title="${esc(tok)}">${esc(tokShort)}</code>
<button class="btn" data-copy="${esc(tok)}">Copy</button>
<button class="btn" data-issue="${esc(uid)}">Issue token</button>
</div>
<div class="muted" style="margin-top:6px">token_created_at: ${esc(tCreated)}; last_used_at: ${esc(tUsed)}</div>
</td>
</tr>`;
}
out.innerHTML = `
<table>
<thead><tr><th>User</th><th>Created</th><th>Updated</th><th>Token</th></tr></thead>
<tbody>${users.map(row).join("") || "<tr><td colspan=4 class=muted>(none)</td></tr>"}</tbody>
</table>
`;
for (const btn of out.querySelectorAll("button[data-copy]")) {
btn.onclick = async () => { await copyText(btn.getAttribute("data-copy") || ""); };
}
for (const btn of out.querySelectorAll("button[data-issue]")) {
btn.onclick = async () => {
const uid = btn.getAttribute("data-issue");
if (!uid) return;
try {
const r = await apiJson("/api/v2/users/" + encodeURIComponent(uid) + "/tokens", { method: "POST" });
createMsg.textContent = "Issued token:\\n" + fmtJson(r);
await refresh();
} catch (e) {
createMsg.textContent = "Error issuing token: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
}
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
}
document.getElementById("refresh").onclick = refresh;
document.getElementById("create-user").onclick = async () => {
createMsg.textContent = "Creating...";
const user_id = (userIdEl.value || "").trim();
const display_name = (displayNameEl.value || "").trim();
if (!user_id) { createMsg.textContent = "user_id is required"; return; }
const payload = { user_id: user_id };
if (display_name) payload.display_name = display_name;
try {
const r = await apiJson("/api/v2/users", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(payload) });
createMsg.textContent = "Created:\\n" + fmtJson(r);
userIdEl.value = "";
displayNameEl.value = "";
await refresh();
} catch (e) {
createMsg.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
refresh();
""".strip()
return HTMLResponse(content=_page("Admin", "admin", body, script))

View File

@ -16,6 +16,11 @@ def _write_config(tmp_path: Path) -> Path:
"entrypoint_resources": {"worker_node": 1},
"runtime_env": {"env_vars": {}},
},
"data": {
# Avoid touching real /private in tests. Keep ray.shared_root as /private
# so existing path validation tests remain unchanged.
"user_root": str(tmp_path / "users"),
},
"service": {
"api": {"host": "127.0.0.1", "port": 0},
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
@ -95,6 +100,17 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch):
assert r3.status_code == 200
assert "pending" in r3.json()
r3b = c.get("/api/v2/tasks?limit=10", headers=headers)
assert r3b.status_code == 200
assert any(t.get("task_id") == "tid1" for t in r3b.json().get("tasks", []))
r3c = c.get("/api/v2/tasks?limit=10&offset=0&states=QUEUED", headers=headers)
assert r3c.status_code == 200
assert all(t.get("state") == "QUEUED" for t in r3c.json().get("tasks", []))
r3d = c.get("/api/v2/tasks?states=NOPE", headers=headers)
assert r3d.status_code == 400
r4 = c.post("/api/v2/tasks/tid1:cancel", headers=headers)
assert r4.status_code == 200
assert r4.json()["state"] == "CANCELED"
@ -118,6 +134,14 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch):
db.create_attempt(task_id="tid2", attempt_no=1, ray_submission_id="sid2")
db.set_task_state(task_id="tid2", state="RUNNING", latest_attempt_no=1)
r6 = c.get("/api/v2/tasks?limit=1&offset=0&states=RUNNING", headers=headers)
assert r6.status_code == 200
assert any(t.get("task_id") == "tid2" for t in r6.json().get("tasks", []))
r7 = c.get("/api/v2/tasks?limit=1&offset=1&states=RUNNING", headers=headers)
assert r7.status_code == 200
assert "has_more" in r7.json()
r5 = c.get("/api/v2/tasks/tid2/logs?tail=1", headers=headers)
assert r5.status_code == 200
assert r5.text.strip() == "c"
@ -163,3 +187,102 @@ def test_submit_rejects_invalid_jobspec(tmp_path: Path, monkeypatch):
with TestClient(app) as c:
r = c.post("/api/v2/tasks", headers={"authorization": "Bearer token1"}, data="workload: nope\n")
assert r.status_code == 400
def test_me_sftp_reset_password_disabled_returns_400(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
# seed user + token
from argus.service.config import V2Config
from argus.service.db import Db
root = yaml.safe_load(cfg_path.read_text(encoding="utf-8"))
v2_cfg = V2Config.from_root_dict(root)
db = Db(v2_cfg.sqlite.db_path)
db.init()
db.create_user(user_id="u1", display_name=None)
token = db.issue_token(user_id="u1")
with TestClient(app) as c:
r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"})
assert r.status_code == 400
def test_me_sftp_reset_password_enabled_returns_password(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg = yaml.safe_load(_write_config(tmp_path).read_text(encoding="utf-8"))
cfg["data"]["sftpgo"] = {
"enabled": True,
"host": "127.0.0.1",
"sftp_port": 2022,
"admin_api_base": "http://127.0.0.1:8081",
"admin_user": "admin",
"admin_password_env": "SFTPGO_ADMIN_PASSWORD",
}
cfg_path = tmp_path / "cfg_sftp.yaml"
cfg_path.write_text(yaml.safe_dump(cfg), encoding="utf-8")
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1")
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
class _FakeSFTPGo:
def __init__(self, **kwargs):
self.reset = []
self.enabled = []
def reset_password(self, username: str, new_password: str, home_dir: str):
assert username
assert new_password
assert home_dir
self.reset.append((username, home_dir))
def enable_user(self, username: str, home_dir: str):
self.enabled.append((username, home_dir))
fake_client = _FakeSFTPGo()
class _FakeSFTPGoFactory:
def __call__(self, **kwargs):
return fake_client
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _FakeSFTPGoFactory())
app = app_mod.create_app(str(cfg_path))
# seed user in DB
from argus.service.db import Db
from argus.service.config import V2Config
v2_cfg = V2Config.from_root_dict(cfg)
db = Db(v2_cfg.sqlite.db_path)
db.init()
db.create_user(user_id="u1", display_name=None)
token = db.issue_token(user_id="u1")
with TestClient(app) as c:
r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"})
assert r.status_code == 200
j = r.json()
assert j["user_id"] == "u1"
assert isinstance(j["password"], str) and len(j["password"]) >= 8

View File

@ -0,0 +1,225 @@
from __future__ import annotations
from datetime import datetime, timedelta, timezone
from pathlib import Path
from argus.service.db import Db
from argus.service.janitor import JobsJanitor
def _iso_z(dt: datetime) -> str:
return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
def _mk_job_dir(user_root: Path, user_id: str, sid: str) -> Path:
p = user_root / user_id / "jobs" / sid
p.mkdir(parents=True, exist_ok=True)
(p / "marker.txt").write_text("x", encoding="utf-8")
return p
def _mk_trash_dir(user_root: Path, user_id: str, sid: str) -> Path:
p = user_root / user_id / "trash" / "jobs" / sid
p.mkdir(parents=True, exist_ok=True)
(p / "marker.txt").write_text("x", encoding="utf-8")
return p
def test_janitor_moves_jobs_to_trash_after_3_days(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t1"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=4)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
src = _mk_job_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert not src.exists()
dst = user_root / user_id / "trash" / "jobs" / sid
assert dst.exists()
assert (dst / "marker.txt").read_text(encoding="utf-8") == "x"
def test_janitor_purges_from_trash_after_7_days(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t2"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=8)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED")
_mk_trash_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
dst = user_root / user_id / "trash" / "jobs" / sid
assert not dst.exists()
def test_janitor_does_not_touch_recent_jobs(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t3"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="grpo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=1)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED")
src = _mk_job_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert src.exists()
assert not (user_root / user_id / "trash" / "jobs" / sid).exists()
def test_janitor_skips_tasks_without_user_id(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "legacy"
sid = "sid-legacy"
db.create_task(task_id=task_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=10)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
# Even if someone created a matching directory under user_root, janitor shouldn't touch it because user_id is NULL.
src = _mk_job_dir(user_root, "alice", sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert src.exists()
def test_janitor_validates_retention_days(tmp_path: Path) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
try:
JobsJanitor(db=db, user_root="/tmp", trash_after_days=-1, purge_after_days=7, interval_s=1)
raise AssertionError("expected ValueError")
except ValueError:
pass
try:
JobsJanitor(db=db, user_root="/tmp", trash_after_days=3, purge_after_days=1, interval_s=1)
raise AssertionError("expected ValueError")
except ValueError:
pass
def test_janitor_noop_when_disabled(tmp_path: Path) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
user_root = tmp_path / "users"
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=0, purge_after_days=0, interval_s=1)
jan.tick_once(now=datetime(2025, 1, 1, tzinfo=timezone.utc))
def test_janitor_handles_invalid_end_time_and_missing_fields(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
cutoff_ended = now - timedelta(days=10)
# Missing end_time (empty string) => should be skipped.
db.create_task_v25(task_id="t4", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id="t4", attempt_no=1, ray_submission_id="sid-empty")
db.update_attempt(task_id="t4", attempt_no=1, end_time="", ray_status="SUCCEEDED")
# Invalid ISO but still lexicographically <= cutoff => should be skipped.
db.create_task_v25(task_id="t5", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id="t5", attempt_no=1, ray_submission_id="sid-bad")
db.update_attempt(task_id="t5", attempt_no=1, end_time="2025-01-01T00:00:00ZZ", ray_status="FAILED")
_mk_job_dir(user_root, "alice", "sid-empty")
_mk_job_dir(user_root, "alice", "sid-bad")
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=cutoff_ended + timedelta(days=10))
assert (user_root / "alice" / "jobs" / "sid-empty").exists()
assert (user_root / "alice" / "jobs" / "sid-bad").exists()
def test_janitor_purge_moves_from_jobs_then_deletes(tmp_path: Path, monkeypatch) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t6"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=9)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
src = _mk_job_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert not src.exists()
assert not (user_root / user_id / "trash" / "jobs" / sid).exists()
def test_janitor_run_forever_requires_event_like(tmp_path: Path) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1)
try:
jan.run_forever(object())
raise AssertionError("expected ValueError")
except ValueError:
pass
def test_janitor_run_forever_survives_tick_exceptions(tmp_path: Path, monkeypatch) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1)
class Flag:
def __init__(self) -> None:
self.n = 0
def is_set(self) -> bool:
self.n += 1
return self.n >= 2
monkeypatch.setattr(jan, "tick_once", lambda **_: (_ for _ in ()).throw(RuntimeError("boom")))
monkeypatch.setattr("argus.service.janitor.time.sleep", lambda *_: None)
jan.run_forever(Flag())

View File

@ -38,3 +38,18 @@ def test_v2_config_requires_mappings():
V2Config.from_root_dict({"service": ["nope"]})
with pytest.raises(ValueError, match="config\\.service\\.\\{api,auth,sqlite,scheduler\\} must be mappings"):
V2Config.from_root_dict({"service": {"api": [1], "auth": {}, "sqlite": {}, "scheduler": {}}})
def test_v2_config_requires_data_mappings():
from argus.service.config import V2Config
base = {
"ray": {"shared_root": "/private"},
"service": {"api": {}, "auth": {}, "sqlite": {}, "scheduler": {}},
}
with pytest.raises(ValueError, match="config\\.data must be a mapping"):
V2Config.from_root_dict({**base, "data": ["nope"]})
with pytest.raises(ValueError, match="config\\.data\\.\\{sftpgo,retention\\} must be mappings"):
V2Config.from_root_dict({**base, "data": {"sftpgo": ["x"], "retention": {}}})

View File

@ -0,0 +1,322 @@
from __future__ import annotations
from pathlib import Path
import yaml
from fastapi.testclient import TestClient
def _write_config(tmp_path: Path, *, enabled: bool) -> Path:
cfg = {
"ray": {
"address": "http://127.0.0.1:8265",
"shared_root": "/private",
"entrypoint_resources": {"worker_node": 1},
"runtime_env": {"env_vars": {}},
},
"data": {
"user_root": str(tmp_path / "users"),
"sftpgo": {
"enabled": bool(enabled),
"admin_api_base": "http://127.0.0.1:8081/api/v2",
"admin_user": "admin",
"admin_password_env": "SFTPGO_ADMIN_PASSWORD",
"host": "h1.internal",
"sftp_port": 2022,
},
},
"service": {
"api": {"host": "127.0.0.1", "port": 0},
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
"sqlite": {"db_path": str(tmp_path / "mvp.sqlite3")},
"scheduler": {"tick_s": 1, "retry_interval_s": 1, "max_running_tasks": 1},
},
}
p = tmp_path / "cfg.yaml"
p.write_text(yaml.safe_dump(cfg), encoding="utf-8")
return p
def test_create_user_calls_sftpgo_when_enabled(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path, enabled=True)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
calls = {"create": [], "disable": [], "reset": []}
class _Client:
def __init__(self, **kwargs):
pass
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
calls["create"].append((username, password, home_dir))
def enable_user(self, *, username: str, home_dir: str) -> None:
# Not used in this test, but required by app for idempotent upsert.
return None
def disable_user(self, *, username: str, home_dir: str) -> None:
calls["disable"].append(username)
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
calls["reset"].append((username, new_password))
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client)
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"})
assert r.status_code == 200
assert calls["create"]
username, password, home_dir = calls["create"][-1]
assert username == "alice"
assert password
assert home_dir.endswith("/users/alice")
r2 = c.post("/api/v2/users/alice:disable", headers=admin_headers)
assert r2.status_code == 200
assert calls["disable"] == ["alice"]
r3 = c.post("/api/v2/users/alice/sftp:reset_password", headers=admin_headers)
assert r3.status_code == 200
body = r3.json()
assert body["user_id"] == "alice"
assert body["password"]
assert calls["reset"] and calls["reset"][-1][0] == "alice"
def test_create_user_upserts_when_sftpgo_user_already_exists(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
from argus.service.sftpgo import SFTPGoError
cfg_path = _write_config(tmp_path, enabled=True)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
calls = {"create": 0, "reset": [], "enable": []}
class _Client:
def __init__(self, **kwargs):
pass
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
calls["create"] += 1
raise SFTPGoError("sftpgo http error: 409 Conflict")
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
calls["reset"].append((username, new_password))
def enable_user(self, *, username: str, home_dir: str) -> None:
calls["enable"].append(username)
def disable_user(self, *, username: str, home_dir: str) -> None:
return None
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client)
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"})
assert r.status_code == 200
body = r.json()
assert body["user_id"] == "bob"
assert body.get("sftp", {}).get("password")
assert calls["create"] == 1
assert calls["reset"] and calls["reset"][-1][0] == "bob"
assert calls["enable"] == ["bob"]
def test_sftpgo_enabled_requires_admin_password_env(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path, enabled=True)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
monkeypatch.delenv("SFTPGO_ADMIN_PASSWORD", raising=False)
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"})
assert r.status_code == 500
assert "SFTPGO_ADMIN_PASSWORD" in r.text
def test_sftpgo_admin_client_builds_requests(monkeypatch):
import json
from urllib.request import Request
from argus.service.sftpgo import SFTPGoAdminClient
import argus.service.sftpgo as mod
seen: list[Request] = []
class _Resp:
def __init__(self, body: bytes = b"ok"):
self._body = body
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return self._body
def fake_urlopen(req, timeout=0):
seen.append(req)
if req.full_url.endswith("/token"):
return _Resp(body=b'{"access_token":"t1","expires_at":0}')
# allow folder creates to be idempotent in tests
if req.full_url.endswith("/folders") and req.get_method() == "POST":
return _Resp(body=b'{"message":"created"}')
if req.full_url.endswith("/users/alice") and req.get_method() == "GET":
return _Resp(
body=b'{"username":"alice","status":1,"home_dir":"/private/users/alice","uid":0,"gid":0,"permissions":{"/":["*"]},"virtual_folders":[]}'
)
return _Resp()
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common")
c.create_user(username="alice", password="p1", home_dir="/private/users/alice")
c.disable_user(username="alice", home_dir="/private/users/alice")
c.enable_user(username="alice", home_dir="/private/users/alice")
c.reset_password(username="alice", new_password="p2", home_dir="/private/users/alice")
# Each operation fetches a token, then issues a request.
# For create_user: token + ensure folders (2 POSTs) + create user
# For disable/enable/reset: token + ensure folders (2 POSTs) + GET user + PUT user
assert len(seen) == 1 + 2 + 1 + 3 * (1 + 2 + 1 + 1)
assert seen[0].full_url.endswith("/token")
assert seen[0].headers.get("Authorization", "").startswith("Basic ")
# create_user
assert seen[1].full_url.endswith("/folders")
assert seen[2].full_url.endswith("/folders")
assert seen[3].full_url.endswith("/users")
assert seen[3].headers.get("Authorization", "") == "Bearer t1"
created = json.loads(seen[3].data.decode("utf-8"))
assert created["username"] == "alice"
assert "/common" in created.get("permissions", {})
assert "/common/datasets" in created.get("permissions", {})
assert "/common/hf" in created.get("permissions", {})
# disable_user
assert seen[4].full_url.endswith("/token")
assert seen[5].full_url.endswith("/folders")
assert seen[6].full_url.endswith("/folders")
assert seen[7].full_url.endswith("/users/alice")
assert seen[7].get_method() == "GET"
assert seen[8].full_url.endswith("/users/alice")
assert seen[8].get_method() == "PUT"
# enable_user
assert seen[9].full_url.endswith("/token")
assert seen[10].full_url.endswith("/folders")
assert seen[11].full_url.endswith("/folders")
assert seen[12].full_url.endswith("/users/alice")
assert seen[12].get_method() == "GET"
assert seen[13].full_url.endswith("/users/alice")
assert seen[13].get_method() == "PUT"
# reset_password
assert seen[14].full_url.endswith("/token")
assert seen[15].full_url.endswith("/folders")
assert seen[16].full_url.endswith("/folders")
assert seen[17].full_url.endswith("/users/alice")
assert seen[17].get_method() == "GET"
assert seen[18].full_url.endswith("/users/alice")
assert seen[18].get_method() == "PUT"
def test_sftpgo_admin_client_http_error_raises(monkeypatch):
from urllib.error import HTTPError
from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError
import argus.service.sftpgo as mod
def fake_urlopen(req, timeout=0):
if req.full_url.endswith("/token"):
class _Resp:
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return b'{"access_token":"t1","expires_at":0}'
return _Resp()
raise HTTPError(req.full_url, 500, "boom", hdrs=None, fp=None)
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common")
try:
c.create_user(username="alice", password="p1", home_dir="/private/users/alice")
assert False, "expected SFTPGoError"
except SFTPGoError as e:
assert "http error" in str(e)
def test_sftpgo_admin_client_url_error_raises(monkeypatch):
from urllib.error import URLError
from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError
import argus.service.sftpgo as mod
def fake_urlopen(req, timeout=0):
if req.full_url.endswith("/token"):
class _Resp:
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return b'{"access_token":"t1","expires_at":0}'
return _Resp()
raise URLError("no route")
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw")
try:
c.disable_user(username="alice", home_dir="/private/users/alice")
assert False, "expected SFTPGoError"
except SFTPGoError as e:
assert "connection error" in str(e)

View File

@ -0,0 +1,78 @@
from __future__ import annotations
from pathlib import Path
from fastapi.testclient import TestClient
from argus.service.app import create_app
def _write_config(tmp_path: Path) -> Path:
p = tmp_path / "cfg.yaml"
p.write_text(
"""
ray:
address: "http://127.0.0.1:8265"
shared_root: "/private"
entrypoint_num_cpus: 1
entrypoint_resources: { worker_node: 1 }
runtime_env: { env_vars: { PYTHONUNBUFFERED: "1" } }
service:
api: { host: "127.0.0.1", port: 8080 }
auth: { token_env: "MVP_INTERNAL_TOKEN" }
sqlite: { db_path: "%(db)s" }
data:
user_root: "%(users)s"
sftpgo: { enabled: false }
retention: { jobs_trash_after_days: 3, jobs_purge_after_days: 7, janitor_interval_s: 3600 }
"""
% {"db": str(tmp_path / "mvp.sqlite3"), "users": str(tmp_path / "users")}
)
return p
def test_ui_routes_render_200(tmp_path, monkeypatch):
cfg = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
app = create_app(str(cfg))
c = TestClient(app)
for path in (
"/ui",
"/ui/login",
"/ui/tasks",
"/ui/tasks/new",
"/ui/data",
"/ui/admin",
"/ui/tasks/any-task-id",
"/ui/tasks/any-task-id/logs",
):
r = c.get(path, allow_redirects=True)
assert r.status_code == 200
assert "<html" in r.text.lower()
def test_ui_contains_sidebar_links(tmp_path, monkeypatch):
cfg = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
app = create_app(str(cfg))
c = TestClient(app)
r = c.get("/ui/tasks")
assert r.status_code == 200
for link in ("/ui/tasks", "/ui/tasks/new", "/ui/data", "/ui/login", "/ui/admin"):
assert link in r.text
assert "Ray Dashboard" in r.text
def test_ui_task_detail_shows_ids(tmp_path, monkeypatch):
cfg = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
app = create_app(str(cfg))
c = TestClient(app)
task_id = "mvp3-ppo-20250101-000000-aaaa"
r = c.get(f"/ui/tasks/{task_id}")
assert r.status_code == 200
assert task_id in r.text
assert f"/ui/tasks/{task_id}/logs" in r.text

View File

@ -14,6 +14,11 @@ def _write_config(tmp_path: Path) -> Path:
"entrypoint_resources": {"worker_node": 1},
"runtime_env": {"env_vars": {}},
},
"data": {
# Avoid touching real /private in tests. Keep ray.shared_root as /private
# so existing path validation tests remain unchanged.
"user_root": str(tmp_path / "users"),
},
"service": {
"api": {"host": "127.0.0.1", "port": 0},
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
@ -59,6 +64,9 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
# list users requires admin
assert c.get("/api/v2/users", headers={"authorization": "Bearer nope"}).status_code in (401, 403)
r1 = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice", "display_name": "Alice"})
assert r1.status_code == 200
assert r1.json()["user_id"] == "alice"
@ -68,6 +76,11 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon
token = r2.json()["token"]
assert token
r2b = c.get("/api/v2/users", headers=admin_headers)
assert r2b.status_code == 200
users = r2b.json()["users"]
assert any(u.get("user_id") == "alice" for u in users)
# non-admin token can access regular endpoints
r3 = c.get("/api/v2/queue", headers={"authorization": f"Bearer {token}"})
assert r3.status_code == 200
@ -177,3 +190,165 @@ def test_submit_rejects_non_common_inputs(tmp_path: Path, monkeypatch):
)
assert r.status_code == 400
assert "code_path must start with /private/common/" in r.text
def test_submit_accepts_user_dataset_paths_and_local_model_paths(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
alice_headers = {"authorization": f"Bearer {alice_tok}"}
# User dataset paths are allowed.
r1 = c.post(
"/api/v2/tasks",
headers=alice_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: Qwen/Qwen2.5-0.5B-Instruct\n"
"train_file: /private/users/alice/datasets/t\n"
),
)
assert r1.status_code == 200
# Local model paths under user models/ are allowed (no TaskSpec schema change).
r2 = c.post(
"/api/v2/tasks",
headers=alice_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: /private/users/alice/models/m1\n"
"train_file: /private/common/datasets/t\n"
),
)
assert r2.status_code == 200
def test_submit_rejects_cross_user_paths_and_bad_local_model_dirs(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"}).status_code == 200
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
bob_tok = c.post("/api/v2/users/bob/tokens", headers=admin_headers).json()["token"]
bob_headers = {"authorization": f"Bearer {bob_tok}"}
# Cross-user dataset path should be rejected.
r1 = c.post(
"/api/v2/tasks",
headers=bob_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: Qwen/Qwen2.5-0.5B-Instruct\n"
"train_file: /private/users/alice/datasets/t\n"
),
)
assert r1.status_code == 400
assert "/private/users/bob/datasets/" in r1.text
# Local model path must be under models/.
r2 = c.post(
"/api/v2/tasks",
headers=bob_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: /private/users/bob/jobs/j1/checkpoints\n"
"train_file: /private/common/datasets/t\n"
),
)
assert r2.status_code == 400
assert "model_id local path must start with" in r2.text
def test_me_returns_paths_and_retention(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
r = c.get("/api/v2/me", headers={"authorization": f"Bearer {alice_tok}"})
assert r.status_code == 200
obj = r.json()
assert obj["user_id"] == "alice"
assert obj["paths"]["home"].endswith("/users/alice")
assert obj["paths"]["jobs"].endswith("/users/alice/jobs")
assert obj["paths"]["trash_jobs"].endswith("/users/alice/trash/jobs")
assert obj["retention"]["jobs_trash_after_days"] == 3
assert obj["retention"]["jobs_purge_after_days"] == 7
def test_create_user_creates_user_dirs(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
base = tmp_path / "users" / "alice"
assert (base / "datasets").is_dir()
assert (base / "models").is_dir()
assert (base / "code").is_dir()
assert (base / "jobs").is_dir()
assert (base / "trash" / "jobs").is_dir()

View File

@ -9,4 +9,4 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${SCRIPT_DIR}/lib.sh"
dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -U pip >/dev/null 2>&1 || true"
dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -r /workspace/mvp/py/requirements.txt"
dexec "${HEAD_CONTAINER}" bash -lc "python3 -c 'import fastapi,uvicorn,yaml' >/dev/null 2>&1 && echo 'api_deps_ok: skip' || (python3 -m pip install -r /workspace/mvp/py/requirements.txt || echo 'WARN: api deps install failed; continuing with preinstalled deps')"

View File

@ -22,7 +22,12 @@ if [[ -z "${MVP_INTERNAL_TOKEN:-}" ]]; then
exit 1
fi
docker exec -d -e MVP_INTERNAL_TOKEN="${MVP_INTERNAL_TOKEN}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'"
env_args=(-e "MVP_INTERNAL_TOKEN=${MVP_INTERNAL_TOKEN}")
if [[ -n "${SFTPGO_ADMIN_PASSWORD:-}" ]]; then
env_args+=(-e "SFTPGO_ADMIN_PASSWORD=${SFTPGO_ADMIN_PASSWORD}")
fi
docker exec -d "${env_args[@]}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'"
echo "[host] started; pid stored in ${PID_PATH} (container path)"
echo "[host] logs: ${LOG_PATH} (container path)"

View File

@ -0,0 +1,242 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
# E2E v3.0:
# - Start Ray (head + stateless workers) + SFTPGo (compose)
# - Start API server with v3.0 config
# - Create user (API triggers SFTPGo user create) and return one-time SFTP password
# - (Optional) verify SFTP login
# - Submit PPO/GRPO/SFT referencing user dataset paths and wait
API_ADDR="${API_ADDR:-http://127.0.0.1:8080}"
ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-}"
USER_ID="${USER_ID:-alice}"
RESET_DB="${RESET_DB:-1}"
RESET_SFTPGO="${RESET_SFTPGO:-0}"
EXPECTED_RAY_NODES="${EXPECTED_RAY_NODES:-3}" # head + 2 workers
CLUSTER_NAME="${CLUSTER_NAME:-argus-ray}"
CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/configs/dev_v30.yaml}"
SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
export SFTPGO_ADMIN_PASSWORD
if [[ -z "${ADMIN_TOKEN}" ]]; then
echo "ERROR: MVP_INTERNAL_TOKEN must be set in host env (admin token)" >&2
exit 1
fi
api_curl_admin() {
curl -sS -H "Authorization: Bearer ${ADMIN_TOKEN}" "$@"
}
api_wait_ready() {
local tries="${1:-60}"
for i in $(seq 1 "${tries}"); do
if curl -sS -m 2 "${API_ADDR}/docs" >/dev/null 2>&1; then
echo "[host] api_ready: ${API_ADDR}"
return 0
fi
echo "[host] waiting api... (${i}/${tries})"
sleep 2
done
echo "ERROR: api not ready: ${API_ADDR}" >&2
return 1
}
sftpgo_wait_ready() {
local tries="${1:-60}"
local url="${2:-http://127.0.0.1:8081/api/v2/token}"
for i in $(seq 1 "${tries}"); do
# SFTPGo admin endpoints require auth; readiness means "HTTP reachable and can issue token".
if curl -sS -m 2 -u "admin:${SFTPGO_ADMIN_PASSWORD}" "${url}" >/dev/null 2>&1; then
echo "[host] sftpgo_ready: ${url} (token ok)"
return 0
fi
echo "[host] waiting sftpgo... (${i}/${tries})"
sleep 2
done
echo "ERROR: sftpgo not ready: ${url}" >&2
return 1
}
ray_wait_ready() {
local tries="${1:-60}"
for i in $(seq 1 "${tries}"); do
if curl -sS -m 2 "${RAY_DASHBOARD_ADDR}/api/version" >/dev/null 2>&1; then
echo "[host] ray_dashboard_ready: ${RAY_DASHBOARD_ADDR}"
return 0
fi
echo "[host] waiting ray dashboard... (${i}/${tries})"
sleep 2
done
echo "ERROR: ray dashboard not ready: ${RAY_DASHBOARD_ADDR}" >&2
return 1
}
ray_wait_nodes() {
local want="${1:-3}"
local tries="${2:-60}"
for i in $(seq 1 "${tries}"); do
local out n
out="$(docker exec -i "${HEAD_CONTAINER}" python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False, logging_level='ERROR'); print(sum(1 for n in ray.nodes() if n.get('Alive')))" 2>/dev/null || true)"
n="$(printf '%s\n' "${out}" | tail -n 1 | tr -cd '0-9' || true)"
if [[ "${n}" =~ ^[0-9]+$ ]]; then
echo "[host] ray_nodes_alive=${n} (want>=${want})"
if [[ "${n}" -ge "${want}" ]]; then
return 0
fi
else
echo "[host] waiting ray nodes... (${i}/${tries})"
fi
sleep 2
done
echo "ERROR: ray nodes not ready (want>=${want})" >&2
docker exec -i "${HEAD_CONTAINER}" bash -lc "ray status || true" >&2 || true
return 1
}
submit_taskspec_inline() {
local token="$1"
local yaml_body="$2"
local resp
resp="$(curl -sS -H "Authorization: Bearer ${token}" -H "Content-Type: application/yaml" --data-binary "${yaml_body}" "${API_ADDR}/api/v2/tasks")"
echo "[host] submit_resp: ${resp}" >&2
printf '%s' "${resp}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["task_id"])'
}
wait_task() {
local token="$1"
local task_id="$2"
while true; do
local body state
body="$(curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}")"
state="$(printf '%s' "${body}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["state"])')"
echo "[host] task ${task_id}: ${state}"
if [[ "${state}" == "SUCCEEDED" ]]; then
return 0
fi
if [[ "${state}" == "FAILED" || "${state}" == "CANCELED" ]]; then
echo "[host] terminal=${state}; tail logs (best-effort):" >&2
curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}/logs?tail=200" >&2 || true
return 1
fi
sleep 10
done
}
echo "[host] ===== run_all_v30_api.sh begin ====="
"${SCRIPT_DIR}/00_prereq_check.sh"
"${SCRIPT_DIR}/03_cleanup_v1_legacy.sh"
"${SCRIPT_DIR}/04_cleanup_v2_legacy.sh"
echo "[host] bring down existing containers (best-effort)"
"${SCRIPT_DIR}/02_down.sh" || true
if [[ "${RESET_SFTPGO}" == "1" ]]; then
echo "[host] reset sftpgo metadata dir (best-effort, via helper container)"
SFTPGO_META_DIR="${ROOT_DIR}/../../shared/common/sftpgo"
mkdir -p "${SFTPGO_META_DIR}"
docker run --rm --entrypoint sh -u 0:0 -v "${SFTPGO_META_DIR}:/mnt" drakkan/sftpgo:latest -lc "rm -rf /mnt/* || true"
fi
echo "[host] (re)create containers (Ray + SFTPGo)"
"${SCRIPT_DIR}/01_up.sh"
echo "[host] wait ray head ready"
ray_wait_ready 60
echo "[host] wait sftpgo ready"
sftpgo_wait_ready 60 "http://127.0.0.1:8081/api/v2/token"
echo "[host] render v3.0 config with SFTPGo container IP (work around docker DNS issues)"
SFTPGO_IP="$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' argus-sftpgo)"
RENDERED_CFG_HOST_PATH="/tmp/dev_v30_rendered.yaml"
sed -E "s#^(\\s*admin_api_base:) .*#\\1 \"http://${SFTPGO_IP}:8080/api/v2\"#g" "${ROOT_DIR}/configs/dev_v30.yaml" >"${RENDERED_CFG_HOST_PATH}"
docker cp "${RENDERED_CFG_HOST_PATH}" "${HEAD_CONTAINER}:/tmp/dev_v30_rendered.yaml"
CONFIG_IN_CONTAINER="/tmp/dev_v30_rendered.yaml"
echo "[host] verify head discovery record (supervised in-container)"
HEAD_IP_FILE="${SHARED_ROOT}/ray/discovery/${CLUSTER_NAME}/head.json"
dexec "${HEAD_CONTAINER}" bash -lc "test -f '${HEAD_IP_FILE}' && python3 -c 'import json,sys; print(json.load(open(sys.argv[1]))[\"job_server_url\"])' '${HEAD_IP_FILE}' || true"
echo "[host] wait workers join"
ray_wait_nodes "${EXPECTED_RAY_NODES}" 120
echo "[host] prepare data/model (idempotent; reuse cache)"
"${SCRIPT_DIR}/30_prepare_data_and_model.sh"
echo "[host] install api deps in head container"
"${SCRIPT_DIR}/12_install_api_deps.sh"
echo "[host] stop api (best-effort)"
"${SCRIPT_DIR}/61_stop_api.sh" || true
if [[ "${RESET_DB}" == "1" ]]; then
echo "[host] reset api sqlite db in container (best-effort)"
docker exec -i "${HEAD_CONTAINER}" bash -lc "rm -f /private/common/db/mvp.sqlite3 /private/common/db/mvp.sqlite3-wal /private/common/db/mvp.sqlite3-shm || true"
else
echo "[host] keep existing api sqlite db (RESET_DB=${RESET_DB})"
fi
echo "[host] start api (admin token + sftpgo admin password via env)"
MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER}" SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD}" "${SCRIPT_DIR}/60_start_api.sh"
api_wait_ready 60
echo "[host] create user (expect SFTP one-time password in response)"
create_resp="$(api_curl_admin -H "Content-Type: application/json" -d "{\"user_id\":\"${USER_ID}\"}" "${API_ADDR}/api/v2/users")"
echo "[host] create_user_resp: ${create_resp}"
USER_SFTP_PASSWORD="$(printf '%s' "${create_resp}" | python3 -c 'import sys,json; o=json.load(sys.stdin); print((o.get("sftp") or {}).get("password") or "")')"
if [[ -z "${USER_SFTP_PASSWORD}" ]]; then
echo "ERROR: expected sftp.password in create user response (is data.sftpgo.enabled=true?)" >&2
exit 1
fi
echo "[host] issue user token"
USER_TOKEN="$(api_curl_admin -X POST "${API_ADDR}/api/v2/users/${USER_ID}/tokens" | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')"
echo "[host] user_token_issued: user=${USER_ID}"
echo "[host] (optional) verify SFTP login (best-effort)"
if command -v sshpass >/dev/null 2>&1 && command -v sftp >/dev/null 2>&1; then
tmp_batch="$(mktemp)"
cat >"${tmp_batch}" <<EOF
pwd
ls
EOF
# NOTE: this just proves auth works; dataset upload can be done via SFTP manually later.
sshpass -p "${USER_SFTP_PASSWORD}" sftp -o StrictHostKeyChecking=no -P 2022 "${USER_ID}@127.0.0.1" -b "${tmp_batch}" >/dev/null 2>&1 || true
rm -f "${tmp_batch}" || true
else
echo "[host] skip: sshpass/sftp not found; you can test manually with: sftp -P 2022 ${USER_ID}@<host>"
fi
echo "[host] ensure user dataset paths exist (copy from common if needed; best-effort)"
echo "[host] copy dataset into user datasets path inside head container (avoid host permission issues)"
dexec "${HEAD_CONTAINER}" bash -lc "set -euo pipefail; \
mkdir -p '/private/users/${USER_ID}/datasets/gsm8k' '/private/users/${USER_ID}/datasets/gsm8k_sft'; \
(cp -f /private/common/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || true)"
echo "[host] submit PPO/GRPO/SFT via API using user dataset paths"
PPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: ppo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
GRPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: grpo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
SFT_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: sft\nnnodes: 1\nn_gpus_per_node: 1\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
echo "[host] submitted task ids:"
echo " ppo=${PPO_TASK_ID}"
echo " grpo=${GRPO_TASK_ID}"
echo " sft=${SFT_TASK_ID}"
echo "[host] wait for tasks (in submission order)"
wait_task "${USER_TOKEN}" "${PPO_TASK_ID}"
wait_task "${USER_TOKEN}" "${GRPO_TASK_ID}"
wait_task "${USER_TOKEN}" "${SFT_TASK_ID}"
echo "[host] ===== run_all_v30_api.sh done ====="

View File

@ -0,0 +1,19 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Minimal E2E runner for v3.0:
# - Happy path: Ray + SFTPGo + API + 3 workloads
# - Leaves retention validation as a manual follow-up (adjust thresholds).
ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-my-dev-token}"
USER_ID="${USER_ID:-alice}"
echo "[case] HP-1: run_all_v30_api.sh (Ray + SFTPGo + API + 3 workloads)"
MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" USER_ID="${USER_ID}" RESET_DB=1 RESET_SFTPGO=0 "${SCRIPT_DIR}/run_all_v30_api.sh"
echo "[case] NOTE: retention validation (C2) is manual:"
echo " - set data.retention.jobs_trash_after_days / jobs_purge_after_days to small values in configs/dev_v30.yaml"
echo " - restart API server and wait for janitor"