v3.0 开发验证完成,满足webui和最小数据管理需求, 任务模版目前偏简单,可以扩展一下,再release开放使用

This commit is contained in:
yuyr 2025-12-30 17:29:50 +08:00
parent 7b89d60b3b
commit 6d3fefc7a6
30 changed files with 3764 additions and 16 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

15
specs/mvp/v3.0/README.md Normal file
View File

@ -0,0 +1,15 @@
# MVP v3.0Design— WebUI + 用户数据上传/下载SFTPGo→ 首个可发布版本
本目录基于:
- `specs/mvp/mvp_roadmap_v2.md`(总体路线图)
- `specs/mvp/image/roadmap_v3.0.png`v3.0 迭代图)
- 当前已落地的 v2.5User Mgmt + Stateless Ray Node Pool
目标是在 v2.5 的基础上补齐 **用户数据闭环**(上传→训练可见→产物下载)以及最小可用的 **WebUI**,形成“可发布”的 v3.0 版本。
文档:
- `specs/mvp/v3.0/v3.0_design.md`总体架构与关键机制WebUI、SFTPGo、数据/权限模型、任务流)。
- `specs/mvp/v3.0/v3.0_api.md`v3.0 API 扩展设计UI、数据、SFTPGo 管理、权限约束)。
- `specs/mvp/v3.0/v3.0_acceptance.md`:部署/升级/验收流程与可验证标准(含故障注入与回归清单)。
- `specs/mvp/v3.0/v3.0_dev_plan.md`TDD 驱动的工程化开发计划里程碑拆分、测试分层、E2E 验收)。
- `specs/mvp/v3.0/v3.0_progress.md`:实施进展记录(每个里程碑完成后追加记录)。

View File

@ -0,0 +1,55 @@
# MVP v3.0 — 部署与验收流程(草案)
## 0) 环境前提
- Ray 集群:延续 v2.5 的 head + stateless worker自动 join
- 共享存储:容器内挂载 `/private`dev/prod 对齐)
- API server宿主机代码挂载到 head 容器,在 head 容器内启动
- 新增SFTPGo 服务(建议容器化部署)
## 1) 部署步骤(高层)
1) 部署/升级 Ray 节点镜像(沿用 v2.5 的 `argus/argus-ray-node:v2.5` 或更高版本)
2) 启动 Ray 集群compose 或平台创建容器)
3) 启动/配置 SFTPGo挂载 `/private`
4) 启动 API serverhead 容器内)
5) 启动 WebUI由 API server 托管)
## 2) 验收用例(必须通过)
### A. 用户与凭据
1) admin 创建用户 `alice`,签发 API token
2) 系统联动在 SFTPGo 创建 `alice`home=/private/users/alice
3) `alice` 使用 token 登录 WebUI或调用 `/api/v2/me` 成功)
### B. 上传数据闭环(核心)
1) `alice` 通过 SFTP 上传数据集到 `/private/users/alice/datasets/...`
2) `alice` 通过 WebUI/API 提交任务TaskSpec 引用该路径
3) Ray worker 读取该数据,任务 RUNNING 并最终 SUCCEEDED
### C. 下载产物闭环
1) 训练完成后,产物落到 `/private/users/alice/jobs/<submission_id>/...`
2) `alice` 通过 SFTP 下载 checkpoints/logs 成功
3) (新增)`alice` 将需要长期保留的权重从 `jobs/<submission_id>/...` 移动到 `models/`,确认移动后可长期存在
### C2. Jobs 回收站与自动清理3 天移入回收站7 天后永久删除)
1) 将 `jobs_trash_after_days`/`jobs_purge_after_days` 配置为较小值(例如分钟级,用于验证)
2) 训练完成进入 terminal 状态
3) 等待 API server 内置 janitor 扫描周期后,确认对应 `jobs/<submission_id>` 被移动到 `trash/jobs/<submission_id>`
4) 在回收站窗口内,把某个文件从 `trash/jobs/<submission_id>` 移动到 `models/`,确认移动成功
5) 等待超过 `jobs_purge_after_days` 后,确认 `trash/jobs/<submission_id>` 被永久删除
6) 确认已移动到 `models/` 的文件不被删除
### D. 安全隔离(必须)
1) `bob` 不能通过 API 查询 `alice` 的 task404
2) `bob` 不能提交引用 `/private/users/alice/...` 的 TaskSpec400/403
3) `bob` 通过 SFTP 无法访问 `/private/users/alice/...`chroot 生效)
## 3) 故障注入(推荐通过)
1) kill worker watchdog 或 raylet → worker 自动恢复并重新加入集群
2) 重启 head 容器 → head 重新写 `head.json`worker 自动重连
3) SFTPGo 重启 → 不影响 Ray 集群;用户可重新连接上传/下载
## 4) 回归清单(与 v2.5 一致)
- 任务队列、重试INSUFFICIENT_RESOURCES → PENDING_RESOURCES → retry
- PPO/GRPO/SFT 三种 workload 均可跑通
- head 不跑训练driver 强制落 worker

109
specs/mvp/v3.0/v3.0_api.md Normal file
View File

@ -0,0 +1,109 @@
# MVP v3.0 — API 扩展设计(基于 v2.5
v3.0 的原则是:**尽量复用 v2.5 API**,只增量增加 “数据闭环” 与 “WebUI 支持” 所需的最小接口。
## 1) 认证与权限
沿用 v2.5
- Header`Authorization: Bearer <token>`
- admin token来自 `MVP_INTERNAL_TOKEN`
- 普通用户 token由 admin 颁发并持久化在 SQLite
权限规则:
- 非 admin只能访问自己的 task、自己的数据空间`/private/users/<user_id>/...`)。
- 跨用户访问返回 404不泄露存在性
## 2) 用户与 SFTPGo 联动(管理员接口)
### 2.1 创建用户(复用 v2.5
`POST /api/v2/users`
- v3.0 行为:成功后,**可选**联动创建 SFTPGo 用户
- v3.0 默认启用联动:创建 SFTPGo 用户 + 生成一次性密码password 认证)
- v3.0 仅保留该方案(方案 A不做外部认证/SSO 集成(留到更后续版本)
- `data.sftpgo.admin_api_base` 推荐形如:`http://argus-sftpgo:8080/api/v2`(包含 `/api/v2` 前缀)
### 2.2 下发 token复用 v2.5
`POST /api/v2/users/{user_id}/tokens`
### 2.3 禁用用户(复用 v2.5
`POST /api/v2/users/{user_id}:disable`
- v3.0 行为:联动禁用 SFTPGo 用户(可选)
### 2.4 SFTP 凭据管理(新增,管理员或用户自助)
(具体由你确认 v3.0 需要“用户自助”还是“管理员操作”)
#### 重置 SFTP 密码(管理员)
`POST /api/v2/users/{user_id}/sftp:reset_password`
- 返回:一次性密码(只返回一次,服务端不保存明文)
> v3.0 先只做 password 方案SSH public key 作为后续版本可选增强(不在 v3.0 范围)。
## 3) 用户自助信息(新增)
### 3.1 获取当前用户信息
`GET /api/v2/me`
- 返回示例:
```json
{
"user_id": "alice",
"is_admin": false,
"paths": {
"home": "/private/users/alice",
"datasets": "/private/users/alice/datasets",
"models": "/private/users/alice/models",
"code": "/private/users/alice/code",
"jobs": "/private/users/alice/jobs",
"trash_jobs": "/private/users/alice/trash/jobs"
},
"retention": {
"jobs_trash_after_days": 3,
"jobs_purge_after_days": 7
},
"sftp": {
"host": "h1.example.internal",
"port": 2022,
"username": "alice"
}
}
```
## 3.2 Jobs Retention 提示(新增)
为了支撑 WebUI 展示与用户预期管理,可在 `/api/v2/me` 或单独接口返回:
- `jobs_trash_after_days`:默认 3
- `jobs_purge_after_days`:默认 7
- `jobs_root``/private/users/<me>/jobs`
- `trash_jobs_root``/private/users/<me>/trash/jobs`
- `recommendations`:提示用户把需要长期保存的产物移动到 `models/``datasets/`
## 4) 数据浏览/下载可选v3.0 最小化)
说明:上传/下载主通道仍是 SFTP。
WebUI 如果要提供“快速浏览/查看”,可实现只读接口(避免实现大文件上传/断点等复杂逻辑)。
### 4.1 列目录
`GET /api/v2/files?path=/private/users/alice`
- 权限path 必须在 `/private/common/``/private/users/<me>/`
- 返回文件列表name/type/size/mtime
### 4.2 下载文件(小文件为主)
`GET /api/v2/files:download?path=/private/users/alice/jobs/.../logs/...`
- 返回:流式下载
- 大文件仍建议走 SFTP
## 5) TaskSpec 路径校验升级v3.0 关键)
v2.5:仅允许 `/private/common/...`
v3.0:允许 `/private/common/...``/private/users/<me>/...`
应用字段(至少):
- `train_file` / `val_file`
- `code_path`:仍仅允许 `/private/common/...`v3.0 不支持执行用户 code
- 本地模型路径字段(如果引入):允许 `/private/users/<me>/models/...`
## 6) WebUI 路由(新增)
由 API server 托管:
- `GET /ui`:主页面
- `GET /ui/login`token 登录页
- 静态资源:`/ui/static/...`
WebUI 的所有操作均调用同源 API不额外开 CORS

View File

@ -0,0 +1,358 @@
# MVP v3.0 详细设计方案(基于 v2.5
## 0. 结论摘要v3.0 要交付什么)
v3.0 = v2.5 + **WebUI** + **用户数据上传/下载SFTPGo**,形成第一个可对外发布的版本:
- 用户可以通过 **SFTP** 上传数据/模型/代码(至少数据),落到 GPFS容器内 `/private`)并对 Ray worker 可见。
- 用户可以通过 API/WebUI 提交训练任务,任务读取自己上传的数据。
- 用户可以下载训练产物checkpoints/logs 等),最小闭环跑通。
## 1. 范围与原则
### 1.1 继承 v2.5 的前提(不回退)
- **Stateless Ray Node Pool**head 写 `head.json`worker watchdog 自动 join/自愈。
- **User Management**token 鉴权、任务可见性隔离(跨用户 404 不泄漏)。
- **作业产物隔离**Ray job 目录落到 `/private/users/<user_id>/jobs/<ray_submission_id>/...`
- **API server 短期运行方式**:代码在宿主机,挂载到 head 容器,在 head 容器内启动(保持现状)。
### 1.2 v3.0 新增目标
1) **Data ManagementSFTPGo**
- 提供用户上传/下载入口SFTP 为主)。
- 数据落到 GPFSdev 环境 NFS/GPFS生产环境 GPFS训练 job 在 worker 容器内可直接读取。
2) **WebUI**
- 用户可视化创建任务、查看队列/状态/日志、查看“数据路径约定”和自己的 SFTP 信息。
- 目标是 “可用而非豪华”,支持核心工作流。
3) **权限闭环**
- 用户只能使用自己目录下的数据(`/private/users/<user_id>/...`)或公共目录(`/private/common/...`)。
- 防止用户提交任务读取其他用户的文件路径。
### 1.3 v3.0 明确不做(留给 v3.5
- 不做 “自定义 reward function / 自定义 verl 代码 / 多版本 verl 共存”(路线图 v3.5)。
- 不做复杂 Serving/训推一体(路线图 v3.5)。
- 不做 IB 网络/拓扑优化(路线图 v3.5)。
- 不做系统级可观测性平台(路线图 v4.0)。
## 2. 架构概览
参考 `roadmap_v3.0.png`v3.0 的控制面与数据面:
### 2.1 控制面Control Plane
- **API ServerFastAPI**
- v2.5 的任务队列/调度/重试 + 用户管理能力继续复用
- 新增:数据管理能力(与 SFTPGo 对接) + WebUI
- **WebUI**
- 通过 API 使用 token 登录
- 提供任务/日志/数据入口(不直接运行训练)
- **Ray Head状态节点**
- 仍在 head 容器内(或单独节点)
- job server/dashbaord 提供 job submit/status/logs 能力
### 2.2 数据面Data Plane
- **GPFS容器内挂载 `/private`**
- 存放 common 与 users 两大根目录
- **Ray Worker Node无状态**
- 自动连接 head执行训练
- 读取 `/private/users/<user>/...` 的数据
### 2.3 新增组件SFTPGoData Management
- 作为独立服务运行(容器化优先),后端存储使用 **filesystem**GPFS 挂载路径)。
- 用户的 home directory 指向 `/private/users/<user_id>`(或其子目录)。
## 3. 存储与目录规范v3.0 统一约定)
### 3.1 目录层级
统一以容器内 `/private` 作为根路径dev/prod 对齐):
- `/private/common/`:公共资源
- `hf/`HF cache
- `datasets/`:公共数据集(可选)
- `code/`:公共代码(例如公共 verl repo snapshot
- `db/`SQLite队列、用户、token
- `logs/`API/supervisor/watchdog 日志
- `/private/users/<user_id>/`用户空间v3.0 重点)
- `datasets/`:用户上传的数据集(推荐)
- `models/`:用户保存/上传的本地模型(允许;也用于“把 job 产物移动到长期保存目录”)
- `code/`用户上传的代码v3.0 **不支持执行**;仅存放/下载)
- `jobs/`:训练任务产物(已在 v2.5 落地)
- `tmp/`:临时文件(可选)
### 3.2 Jobs Retention两段式3 天移入回收站7 天后永久删除)
v3.0 引入 **jobs 目录两段式保留策略**
- 第 1 阶段soft-deletejob 结束后 **3 天**,将该 job 目录从 `jobs/` **移动到用户回收目录**
- 第 2 阶段hard-delete进入回收目录后再过 **7 天**,从回收目录 **永久删除**
目录约定(建议):
- jobs 根目录:`/private/users/<user_id>/jobs/<ray_submission_id>/...`
- 回收目录:`/private/users/<user_id>/trash/jobs/<ray_submission_id>/...`
计时规则:
- 以 job 进入 terminal 状态SUCCEEDED/FAILED/CANCELED的结束时间为起点
- “3 天”用于从 `jobs/` 移入 `trash/jobs/`
- “7 天”用于从 `trash/jobs/` 永久删除(即总共最多 10 天窗口)。
用户保留关键产物的方式(无需 keep 标记):
- 在 “3 天窗口”内把需要长期保存的文件从 `jobs/<submission_id>/...` **移动/复制**到 `models/`(例如权重)或 `datasets/`(例如评估输出数据);
- 即便已被移动到回收目录,用户仍可在 “7 天窗口”内从 `trash/jobs/<submission_id>/...` 把需要的文件移到 `models/` / `datasets/`
- janitor 只管理 `jobs/``trash/jobs/`,不会触碰 `models/``datasets/`
这里的“清理程序”我们称为 **janitor**
- 定义:一个后台清理执行器,按固定周期扫描“已结束且已过期”的 job 目录并删除
- v3.0 目标实现“3 天移入回收站 + 7 天后删除”这一条产品规则(不提供 keep/延长保留标记)
实现建议(按你的偏好):
- **janitor 作为 API server 内置后台线程**运行:
- 优点:天然可访问 SQLite任务状态、结束时间、user_id、ray_submission_id并能把清理结果写回 events 表用于审计
- 部署更简单:不额外引入 cronjob/独立服务
- 删除/移动动作建议 **直接在 GPFS/NFS 文件系统上操作**API server 运行在 head 容器,已挂载 `/private`
- 第 1 阶段:`os.rename`(同文件系统原子移动)把 `jobs/<sid>` 移到 `trash/jobs/<sid>`
- 若跨文件系统(理论上不应发生),则降级为 copy+delete
- 移动前做严格路径前缀校验(必须在 `.../users/<u>/jobs/` 下)。
- 第 2 阶段:对 `trash/jobs/<sid>` 执行递归删除(例如 `shutil.rmtree`),同样做路径前缀校验(必须在 `.../users/<u>/trash/jobs/` 下)。
- 为什么不依赖 SFTPGo APISFTPGo 只是用户访问协议层SFTP/Web目录物理就在同一份文件系统文件系统直连更简单、也不依赖 SFTPGo 在线。
- 如果你强烈希望“通过 SFTPGo API 删除”:
- 可以作为可选实现/补充(例如用于统一审计或未来接入配额/策略但不建议作为唯一手段SFTPGo 停机不应阻塞清理)。
### 3.3 用户在 SFTPGo 内移动/整理文件(确认点)
支持用户在 SFTPGo 中进行“移动/重命名/整理”(例如把权重从 `jobs/` 移动到 `models/`
- 前提SFTPGo 用户权限允许对其 home 目录进行 `rename/mkdir/remove` 等操作v3.0 默认可写)。
- 行为:用户可以把 `jobs/` 下某些文件移动到 `models/``datasets/`,用于长期保存权重/评估产物等。
- 与 retention 的关系:只要文件被移动出 `jobs/`,就不会被 jobs 清理逻辑删除。
### 3.4 路径权限规则API 侧校验)
v2.5 约束是 “只允许 `/private/common/...`”。
v3.0 需要升级为:
- 允许:
- `/private/common/...`
- `/private/users/<current_user_id>/...`
- 禁止:
- 任何其他绝对路径(例如 `/private/users/other/...``/etc/...`
并把该规则应用到 TaskSpec 的相关字段(至少):
- `train_file` / `val_file`
- `code_path`:仍仅允许 `/private/common/...`v3.0 不支持执行用户 code
- 本地模型路径字段:允许 `/private/users/<me>/models/...`确认v3.0 允许)
## 4. SFTPGo 方案设计Data Management
### 4.1 运行形态
推荐用容器运行 SFTPGo与 Ray/API 解耦),挂载同一份 `/private`
- `sftpgo` 容器挂载 `../../shared:/private`
- 对外暴露:
- SFTP 端口(建议 2022
- WebAdmin/API 端口(建议 8081仅内网或管理员访问
#### 4.1.1 镜像来源(现成 Docker 镜像)
SFTPGo 有现成可用的 Docker 镜像(无需自建):
- v3.0 推荐优先使用官方/上游发布的 `sftpgo` 镜像作为运行基座
- 我们在 v3.0 里不需要定制 SFTPGo 代码,只需要:
- 正确挂载 GPFS/NFS容器内 `/private`
- 配置管理员账号(用于 API server 联动创建/禁用用户、重置密码)
- 配置每用户 home/chroot
> 注意:具体镜像名/tag 在不同环境可能有差异(官方/镜像仓库策略会变动)。落地时建议在 `argus@h1``docker search sftpgo` 或由你们内部镜像仓库提供固定版本v3.0 设计只要求“使用现成镜像”,不强依赖某个 tag。
#### 4.1.2 docker-compose 服务草案(示意)
下面给出一个**示意**(最终以实际镜像名/tag 与你们端口规划为准):
```yaml
services:
sftpgo:
image: sftpgo/sftpgo:latest # 示例:使用现成镜像
container_name: argus-sftpgo
ports:
- "2022:2022" # SFTP
- "8081:8080" # WebAdmin/API建议仅内网/管理员)
volumes:
- ../../shared:/private
- ../../shared/common/sftpgo:/var/lib/sftpgo # 持久化 SFTPGo 元数据(可选/建议)
environment:
# 管理员账号/密码(示意,具体变量名以镜像文档为准)
SFTPGO_ADMIN_USERNAME: "admin"
SFTPGO_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD}"
```
与 v3.0 的配合点:
- API server 使用 `data.sftpgo.admin_api_base` + admin 凭据联动创建用户
- 用户 home/chroot 统一指向 `/private/users/<user_id>`
### 4.2 用户隔离
每个用户在 SFTPGo 中的 home dir 绑定到:
- `/private/users/<user_id>`chroot用户只能读写自己的目录。
### 4.3 用户创建与凭据管理(两种实现,建议先做 A
**方案 Av3.0 推荐API Server 负责“联动创建 SFTPGo 用户”**
- 在 v2.5 的 `POST /api/v2/users` 成功后:
- API server 调用 SFTPGo 管理 API 创建同名用户
- 设置 home dir = `/private/users/<user_id>`
- 设置权限(默认可写;是否只读可配置)
- 认证方式:
- v3.0 最小可用:用户名+密码确认v3.0 先 passwordAPI 生成一次性密码,用户首次登录后要求改密)
- 或SSH public keyWebUI 允许上传 public keyAPI 写入 SFTPGo
**方案 B更强但复杂SFTPGo 外部认证**
- SFTPGo 把认证委托给 API servertoken/SSOSFTP 也走内部 token。
- 复杂度高,建议 v3.0 不做,放到 v3.5 或更后。
### 4.4 用户上传/下载体验
用户通过 SFTP 上传:
- `datasets/...`(训练数据)
- `models/...`(本地模型,可选)
下载:
- `jobs/<ray_submission_id>/...`checkpoints/logs
WebUI/文档提供 “路径如何写进 TaskSpec” 的指引。
## 5. WebUI 方案设计(最小可用)
### 5.1 目标页面
v3.0 WebUI 采用“**多子页面 + 侧边导航栏**”而不是把所有功能挤到单页:
- 原因信息密度更可控后续可扩展v3.5+)且不会把一个页面做成“巨型表单/巨型列表”。
- 实现仍保持轻量:服务端渲染(或静态 HTML + 少量 JS不引入复杂前端工程。
信息架构IA建议如下
1) **登录页**`/ui/login`
- 用户粘贴 token管理员发放浏览器保存localStorage/sessionStorage
- 提供“退出登录/清空 token”
2) **任务列表页**`/ui/tasks`
- 默认列表:最近 N 条任务(按 created_at 倒序)
- 支持过滤workload、stateQUEUED/RUNNING/SUCCEEDED/FAILED/CANCELED、时间范围
- 支持快捷操作:进入详情、取消任务
3) **新建任务页**`/ui/tasks/new`
- 两种模式(二选一,均可实现):
- **YAML 直接提交**:上传/粘贴 TaskSpec YAML最省开发
- **表单生成 YAML**:选择 workload填写核心字段train/val/model/nnodes/gpus生成 YAML 预览后提交
- 提交后跳转到任务详情页
4) **任务详情页**`/ui/tasks/{task_id}`
- 顶部task_id、workload、state、created_at、updated_at、error_summary
- Attempt 卡片latest attempt_no、ray_submission_id、ray_status、start/end
- 操作区:取消任务(若非 terminal、刷新状态、复制路径/ID
- 链接到日志页与产物提示SFTP 路径)
5) **任务日志页**`/ui/tasks/{task_id}/logs`
- 默认 tail=2000可选 200/1000/5000
- 提供“自动刷新(每 3~5 秒)”开关(简单轮询即可)
6) **数据页**`/ui/data`
- 显示 SFTP 连接信息host/port/username
- 显示用户目录约定:
- home`/private/users/<user_id>`
- datasets`/private/users/<user_id>/datasets`
- models`/private/users/<user_id>/models`
- jobs`/private/users/<user_id>/jobs`
- trash/jobs`/private/users/<user_id>/trash/jobs`
- 明确 retentionjobs 结束后 3 天移入回收站,回收站 7 天后删除;重要文件请移到 `models/``datasets/`
7) **(仅管理员可见)用户管理页**`/ui/admin/users`,可选但很有价值)
- 创建用户、禁用用户、签发 token、重置 SFTP 密码(方案 A
### 5.2 页面组织与导航(建议)
侧边栏导航(普通用户):
- Tasks列表
- New Task新建
- DataSFTP/目录说明)
管理员侧边栏额外增加:
- Admin / Users
### 5.3 大致示意图wireframe
下面是一个粗略示意(非最终 UI仅表达信息结构与布局
```
┌──────────────────────────────────────────────────────────────────────┐
│ Argus MVP v3.0 [user: alice] │
├───────────────┬──────────────────────────────────────────────────────┤
│ Side Nav │ /ui/tasks │
│ │ │
│ • Tasks │ [Filter] workload=all state=all [Search task_id] │
│ • New Task │ │
│ • Data │ Task List │
│ • Admin(*) │ ┌────────────────────────────────────────────────┐ │
│ │ │ task_id workload state ... │ │
│ │ │ mvp2-alice-ppo-... ppo RUNNING ... │ │
│ │ │ mvp2-alice-sft-... sft SUCCEEDED... │ │
│ │ └────────────────────────────────────────────────┘ │
│ │ [View] [Cancel] │
└───────────────┴──────────────────────────────────────────────────────┘
```
任务详情页(示意):
```
┌──────────────────────────────────────────────────────────────────────┐
│ /ui/tasks/{task_id} │
├──────────────────────────────────────────────────────────────────────┤
│ task_id: mvp2-alice-ppo-... state: RUNNING workload: ppo │
│ created_at: ... updated_at: ... │
│ error_summary: (empty) │
│ │
│ latest_attempt: a01 ray_submission_id: ...--a01 ray_status: RUNNING │
│ [Open Logs] [Cancel Task] [Refresh] │
│ │
│ Artifacts (SFTP paths): │
│ jobs/: /private/users/alice/jobs/<ray_submission_id>/ │
│ trash/: /private/users/alice/trash/jobs/<ray_submission_id>/ │
│ tip: move important files to /private/users/alice/models/ │
└──────────────────────────────────────────────────────────────────────┘
```
### 5.2 技术取舍(建议:不引入 Node 构建)
为了降低部署复杂度,建议 v3.0 WebUI 以 “服务端渲染 + 少量 JS/HTMX” 或 “纯静态 HTML+fetch” 实现:
- 由 API server 提供静态资源FastAPI StaticFiles
- 页面调用同源 API避免跨域与复杂前端构建链
## 6. API 扩展设计(概览)
v3.0 可以保持 `/api/v2/...` 不变,增量加:
- SFTPGo 集成管理端点(管理员):
- 创建/禁用用户时联动 SFTPGo
- 重置 SFTP 密码 / 更新 SSH key
- 用户数据端点(可选,最小化):
- `/api/v2/me`:返回 user_id、SFTP 信息host/port/home
- `/api/v2/files`:仅用于浏览/下载(上传仍走 SFTP
详细见 `specs/mvp/v3.0/v3.0_api.md`
## 7. 配置与部署v3.0 新增项)
`configs/dev.yaml` 基础上扩展一组 `data` 配置(示意):
```yaml
data:
shared_root: "/private" # 通常与 ray.shared_root 一致
user_root: "/private/users" # 用户空间根目录
allow_common_prefix: "/private/common/"
allow_user_prefix_template: "/private/users/{user_id}/"
sftpgo:
enabled: true
host: "127.0.0.1"
sftp_port: 2022
admin_api_base: "http://127.0.0.1:8081/api/v2"
admin_user: "admin"
admin_password_env: "SFTPGO_ADMIN_PASSWORD" # 仅 head 容器内可读
retention:
jobs_trash_after_days: 3
jobs_purge_after_days: 7
trash_root_template: "/private/users/{user_id}/trash/jobs"
janitor_interval_s: 3600 # 每小时扫一次(可配置)
```
## 8. 风险点与对策
1) **路径逃逸/越权读取**
- 必须在 API 提交任务时校验路径前缀
- SFTPGo 必须 chroot 到用户 home
2) **大文件上传稳定性**
- 优先用 SFTP断点续传/可靠性更好)
3) **用户 token 与 SFTP 凭据的生命周期**
- token 走 v2.5 SQLite
- SFTP 凭据建议独立(密码/SSH key并提供 reset 流程
4) **GPFS/NFS 权限**
- 确保 `/private/users/<user>` 目录权限可被 SFTPGo 写入且 worker 可读
## 9. 已确认结论(来自你的反馈)
1) 允许用户上传并在训练时使用自定义数据集:允许(`/private/users/<u>/datasets/...`)。
2) 允许用户上传并在训练时使用本地模型路径:允许(`/private/users/<u>/models/...`)。
3) v3.0 不允许执行用户自定义代码(不注入 `PYTHONPATH` 作为可执行 code path
4) SFTPGo 认证方式v3.0 先 password。
5) WebUI按“简单最小必要功能”做token 粘贴登录优先)。
## 10. 待确认问题(需要你给结论)
已确认jobs 清理执行主体v3.0 采用 **API server 内置 janitor 后台线程**

View File

@ -0,0 +1,232 @@
# MVP v3.0 开发计划TDD 驱动)
本文是 v3.0 的**工程化开发计划**强调“先写测试再写实现”TDD并将每个里程碑拆成**可独立验收**的小闭环。
输入依据:
- 路线图:`specs/mvp/mvp_roadmap_v2.md`
- v3.0 设计:`specs/mvp/v3.0/v3.0_design.md`
- v3.0 API`specs/mvp/v3.0/v3.0_api.md`
- v3.0 验收:`specs/mvp/v3.0/v3.0_acceptance.md`
- 现状基线v2.5Task queue + User mgmt + Stateless ray pool + 单镜像节点守护)
v3.0 已确认约束:
- 允许用户数据集路径:`/private/users/<me>/datasets/...`
- 允许用户本地模型路径:`/private/users/<me>/models/...`
- **不允许执行用户自定义代码**(不注入 user code 到 PYTHONPATH`code_path` 仍只允许 `/private/common/...`
- SFTPGo 先用 **password** 方案(方案 AAPI 联动创建/管理 SFTPGo 用户)
- jobs retention**3 天移入回收站trash/jobs再 7 天永久删除**;不提供 keep/延长保留标记
- janitor**API server 内置后台线程**;删除/移动采用**文件系统直接操作**(不依赖 SFTPGo API
---
## 0. TDD 规范(所有功能都遵循)
### 0.1 测试分层
1) **单元测试fast**
- 纯 Python 逻辑路径策略、SFTPGo client、retention 计算、文件移动/删除策略(用临时目录)。
- 不依赖真实 Ray、不依赖 docker、不依赖网络。
2) **组件测试(中等)**
- FastAPI 路由(含 WebUI 路由):`fastapi.testclient.TestClient`
- mock/stub SFTPGo client 与 ray client
3) **端到端(慢)**
- 在 `argus@h1` 通过 docker compose + scripts
- Ray 集群自动起来head+2 worker
- SFTPGo 服务可用
- 上传数据 → 提交训练 → 下载产物 → jobs 回收站/清理
### 0.2 代码与测试约定
- 测试目录:`src/mvp/py/tests/`
- 新功能必须先补齐测试用例,并让其在未实现时失败(红)
- 最小实现让测试变绿(绿)
- 再做重构(重构)
- 覆盖率:继续沿用当前阈值(>= 90%
---
## 1. 里程碑拆分v3.0 = 5 个可验证闭环)
### M1TaskSpec 路径策略升级(允许 user datasets/modelscode_path 仍仅 common
**目标**
- API submit 时的路径校验从 v2.5 的 “仅 `/private/common/`” 升级为:
- `train_file` / `val_file`:允许 `/private/common/...``/private/users/<me>/...`
- 本地模型路径:允许 `/private/users/<me>/models/...`(不改变 YAML 结构,见实现建议)
- `code_path`:仍仅允许 `/private/common/...`
- 阻止越权路径(`/private/users/other/...`)与非 `/private/...` 路径。
**实现建议(不扩展 TaskSpec**
- `model_id` 字段保持不变:
- 若 `model_id``/private/` 开头 → 视作本地模型路径
- 否则视作 HuggingFace repo id`Qwen/...`
**TDD 用例(先写测试)**
- 单测:
- `test_paths_allow_common_and_own_user_prefix()`
- `test_paths_reject_other_user_prefix()`
- `test_model_id_local_path_allowed_only_under_users_models()`
- `test_code_path_still_common_only()`
- API 测试:
- `test_submit_accepts_user_datasets_paths()`
- `test_submit_rejects_cross_user_paths_404_or_400()`(按约定返回 400/403
**验收点**
- `v3.0_acceptance.md` 的 D 类安全隔离用例可由 API 测试覆盖。
---
### M2SFTPGo 集成(方案 A用户联动创建 + password
**目标**
- 引入 `data management (SFTPGo)`
- admin 创建用户时联动创建 SFTPGo 用户home=/private/users/<user_id>chroot
- password 模式生成一次性密码reset/create并返回给 admin明文只返回一次
- 提供用户自助信息:
- `GET /api/v2/me` 返回 SFTP 连接信息、目录约定、retention 提示。
**实现建议**
- 新增 `SFTPGoAdminClient`(同步调用):
- 通过 `urllib``httpx`(建议 `urllib`,减少依赖;禁止 hard-code requests 使用)
- 支持create user / disable user / reset password最小集合
- API server 启动时校验配置enabled 时必须具备 admin 密码 env
- 同步创建用户目录结构(文件系统):
- `/private/users/<u>/{datasets,models,code,jobs,trash/jobs}`(最小必需)
**TDD 用例(先写测试)**
- 单测:
- `test_sftpgo_client_builds_correct_requests()`不发真实网络mock urlopen
- `test_user_dirs_created_on_user_create()`tmp dir 断言目录存在)
- API 测试:
- `test_create_user_calls_sftpgo_client()`stub client断言调用参数
- `test_me_returns_sftp_info_and_paths()`(含 trash/jobs 与 TTL 字段)
**验收点**
- `v3.0_acceptance.md` 的 A 类(用户/凭据)与 B 类(上传闭环前置)覆盖。
---
### M3WebUI最小可用多页面 + 侧边栏)
**目标**
- WebUI 由 API server 托管(同源,无额外 CORS
- `/ui/login`token 粘贴登录localStorage
- `/ui/tasks`:任务列表 + 过滤(最小)
- `/ui/tasks/new`YAML 提交(优先)+(可选)表单生成 YAML
- `/ui/tasks/{task_id}`:详情页
- `/ui/tasks/{task_id}/logs`:日志 tail + 可选自动刷新
- `/ui/data`SFTP 信息 + 目录/retention 提示
- (可选)`/ui/admin/users`:管理员用户管理(若时间允许,强烈建议)
**实现建议**
- 先不引入 Node 构建:
- HTML 模板可用最简单的字符串拼接或 Jinja2若引入 jinja2则补齐依赖与测试
- 页面通过 fetch 调用 `/api/v2/...`,并复用 token header
**TDD 用例(先写测试)**
- 组件测试TestClient
- `test_ui_routes_render_200()`
- `test_ui_contains_sidebar_links()`(简单断言文本包含导航链接)
- `test_ui_tasks_detail_shows_ids()`(包含 task_id、state、ray_submission_id
**验收点**
- WebUI 能完成:登录→创建任务→查看任务→查看日志→看到 data 页提示。
---
### M4Jobs Retention janitor3 天移入 trash7 天后 purge
**目标**
- API server 内置 janitor 后台线程:
- 周期性扫描 DB 中 terminal tasks
- 到期后执行:
- move`/private/users/<u>/jobs/<sid>``/private/users/<u>/trash/jobs/<sid>`
- purge递归删除 `/private/users/<u>/trash/jobs/<sid>`
- 全程严格 path 校验,禁止越界删除
- 清理操作记录到 DB events审计
**实现建议(数据与状态)**
- 需要稳定的时间锚点与幂等:
- 使用 attempts.end_time 作为 job 结束时间latest attempt
- 在 tasks 表新增字段(或新表)记录:
- `trashed_at`(首次成功 move 时间)
- `purged_at`(成功删除时间)
- `trash_path`(可选)
- 幂等:重复运行不会报错(目录不存在视为已处理)
**TDD 用例(先写测试)**
- 单测(用 tmpdir 构造 jobs/trash 目录):
- `test_janitor_moves_job_to_trash_after_threshold()`
- `test_janitor_purges_trash_after_threshold()`
- `test_janitor_never_touches_models_or_datasets()`
- `test_janitor_path_escape_rejected()`(恶意 path 不可删)
- API/组件测试:
- `test_me_includes_retention_fields()`jobs_trash_after_days/jobs_purge_after_days
**验收点**
- `v3.0_acceptance.md` 的 C2 用例可按“把阈值调小到分钟级”完成验证。
---
### M5端到端h1— SFTP 上传→训练→产物下载→回收站/清理
**目标**
- 在 `argus@h1` 落一个一键脚本(或手册)跑通:
1) `docker compose up -d` 拉起 Rayhead+2 worker+ SFTPGo
2) admin 创建用户 alice联动创建 SFTPGo 用户 + password
3) alice 通过 SFTP 上传:
- 数据集到 `/private/users/alice/datasets/...`
- (可选)本地模型到 `/private/users/alice/models/...`
4) alice 通过 API/WebUI 提交任务引用上述路径
5) 任务成功后:
- 从 `jobs/<sid>` 下载 logs/checkpoints
- 把权重移动到 `models/`,验证不会被清理
6) 把 retention 配置调小,验证 jobs→trash→purge
**交付建议**
- 新增脚本(命名示例):
- `scripts/run_all_v30_api.sh`
- `scripts/run_e2e_v30_cases.sh`
- 新增 `docker-compose.yaml` 中的 `sftpgo` service`docker-compose.v30.yaml` 叠加文件)
**验收点**
- `v3.0_acceptance.md` 全部 MUST 用例通过。
---
## 2. 风险与测试关注点
1) **权限与路径逃逸**
- path policy 必须覆盖train/val/model_id(local)/output dirsjobs/trash
- 所有删除/移动必须做 prefix 校验
2) **并发与竞态**
- janitor 只处理 terminal tasks避免清理正在写入的目录
- move 使用同文件系统 `os.replace`(原子)
3) **SFTPGo 可用性**
- SFTPGo 不在线不应影响训练与 API 核心功能(除了用户创建联动)
- janitor 不依赖 SFTPGo文件系统直连
---
## 3. 交付清单(代码/配置/脚本/文档)
### 3.1 代码
- Path policyv3.0
- SFTPGoAdminClient + user create/disable/reset password 联动
- `/api/v2/me` 扩展SFTP/目录/retention
- WebUI 路由与静态资源
- janitortrash+purge后台线程 + DB 记录
### 3.2 配置
- `configs/dev.yaml` 增加 `data.sftpgo``data.retention` 段(详见设计文档)
### 3.3 scripts / compose
- compose 增加 `sftpgo`(或新增 overlay compose 文件)
- v3.0 e2e 脚本(上传/下载/清理验证)
### 3.4 文档
- 更新 `specs/mvp/v3.0/*``src/mvp/README.md`运行方式、路径约定、SFTP 操作、retention 解释)

View File

@ -0,0 +1,154 @@
# MVP v3.0 进展记录milestone log
本文档用于记录 v3.0 按 `specs/mvp/v3.0/v3.0_dev_plan.md` 实施过程中的里程碑完成情况。
约定:每完成一个里程碑,追加一条记录,包含**日期**、**完成内容**、**涉及文件**、**验证方式/结果**、**待办/风险**。
---
## M1Path policy + tests已完成
- 日期2025-12-30
- 范围:按 v3.0 路径策略升级 API submit 的路径校验(不扩展 TaskSpec YAML 结构)。
- 完成内容:
- `code_path`:仍只允许 `/private/common/...`v3.0 不执行 user code
- `train_file`/`val_file`:允许 `/private/common/datasets/...``/private/users/<me>/datasets/...`
- `model_id`:若以 `/private/` 开头则视为本地路径,仅允许:
- `/private/common/models/...`
- `/private/users/<me>/models/...`
否则仍按 HuggingFace repo id`Qwen/...`)处理。
- 拒绝跨用户路径(例如 `bob` 提交 `/private/users/alice/datasets/...`)。
- 拒绝本地模型路径不在 `models/`(例如指向 `jobs/`)。
- 涉及文件:
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_users.py`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`54 passed`),覆盖率阈值保持 `>= 90%`
- 待办/风险:
- `model_id=/private/...` 的“本地模型路径语义”需要在用户文档/WebUI 中明确提示(避免误用)。
- 后续 M2/M3 需要把该路径策略同步到 UI 表单/提示文本(避免用户填错路径)。
---
## M2SFTPGo 集成(方案 A用户联动创建 + password已完成
- 日期2025-12-30
- 范围SFTPGoData Management最小集成 + 用户自助信息 `/api/v2/me` + 用户目录结构落盘。
- 完成内容:
- 新增 `data` 配置段:
- `data.user_root`:用户数据根目录(默认 `/private/users`
- `data.sftpgo`SFTPGo 可选联动enabled/host/sftp_port/admin_api_base/admin_user/admin_password_env
- `data.retention`jobs 过期策略配置3 天移入 trash7 天 purgejanitor 在 M4 实现)
- 新增 `SFTPGoAdminClient``urllib` 实现,不使用 `requests`
- `create_user` / `disable_user` / `reset_password`(最小集合)
- API server 增强:
- `POST /api/v2/users`:创建 DB user + 同步创建目录结构(`datasets/models/code/jobs/trash/jobs`
- 当 `data.sftpgo.enabled=true` 时,创建用户会联动调用 SFTPGo admin API并返回一次性密码明文仅返回一次服务端不保存
- `POST /api/v2/users/{user_id}:disable`禁用用户SFTPGo 禁用 best-effort
- `POST /api/v2/users/{user_id}/sftp:reset_password`管理员重置一次性密码SFTPGo enabled 才允许)
- `GET /api/v2/me`返回当前用户的目录约定、retention 提示以及可选SFTP 连接信息
- 同步更新 `src/mvp/configs/dev.yaml`:补齐 v3.0 相关 `data.*` 配置(默认关闭 sftpgo
- 涉及文件:
- `src/mvp/py/argus/service/config.py`
- `src/mvp/py/argus/service/sftpgo.py`
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_sftpgo.py`
- `src/mvp/py/tests/test_users.py`
- `src/mvp/py/tests/test_app.py`
- `src/mvp/py/tests/test_service_config.py`
- `src/mvp/configs/dev.yaml`
- `specs/mvp/v3.0/v3.0_api.md`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`62 passed`),覆盖率 `90.11%`(阈值 `>= 90%`)。
- 待办/风险:
- M2 仅做了“API 侧联动 + 单测”,未在真实 SFTPGo 容器上端到端验证(按计划在 M5 完成)。
- 目录创建依赖文件系统权限:生产部署时需确保 API/head 容器对 `/private/users` 可写。
---
## M3WebUI最小可用多页面 + 侧边栏)(已完成)
- 日期2025-12-30
- 范围API server 托管最小 WebUI同源不引入 Node 构建),用于登录/提交/查看任务与日志、查看 data 信息。
- 完成内容:
- 新增 UI 路由HTML+少量 JS
- `/ui`(重定向到 tasks
- `/ui/login`token 粘贴并写入浏览器 localStoragekey=`mvp_token`
- `/ui/tasks`:任务队列列表(调用 `/api/v2/queue`
- `/ui/tasks/new`:提交 TaskSpec YAMLPOST `/api/v2/tasks`
- `/ui/tasks/{task_id}`任务详情GET `/api/v2/tasks/{task_id}`,支持 cancel
- `/ui/tasks/{task_id}/logs`日志查看GET `/api/v2/tasks/{task_id}/logs`,可选自动刷新)
- `/ui/data`:展示 `/api/v2/me` 返回的路径/SFTP/retention 信息
- 统一侧边栏导航Tasks / New Task / Data / Login。
- UI 不做服务端 session所有 API 调用均由浏览器带 `Authorization: Bearer <token>`localStorage 注入)。
- 涉及文件:
- `src/mvp/py/argus/service/ui.py`
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_ui.py`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`65 passed`),覆盖率 `90.53%`(阈值 `>= 90%`)。
- 待办/风险:
- WebUI 当前为“骨架+API 驱动”,不做复杂交互与大文件下载;上传/下载仍以 SFTP 为主(按设计)。
- Starlette TestClient 的 `allow_redirects` 有弃用告警(不影响功能,可在后续清理)。
---
## M4Jobs Retention janitor3 天移入 trash7 天后 purge已完成
- 日期2025-12-30
- 范围API server 内置后台线程,对“已结束 attempt”的 job 目录执行保留策略(文件系统直连,不依赖 SFTPGo
- 完成内容:
- 新增 `JobsJanitor`
- 以 `attempts.end_time` 为基准计算 TTL从 job 结束开始算)
- `>= 3 天 && < 7 天`:把目录从 `.../jobs/<ray_submission_id>` 移动到 `.../trash/jobs/<ray_submission_id>`
- `>= 7 天`:确保目录进入 trash 后删除(`shutil.rmtree`
- 对缺失目录、异常移动/删除为 best-effort不影响服务主流程
- DB 增强:新增查询 `list_ended_attempts_before()`,用于 janitor 扫描候选 attempt。
- API server 启动时启动 janitor 线程(可通过 `data.retention.janitor_interval_s` 控制;<=0 视为关闭)。
- 涉及文件:
- `src/mvp/py/argus/service/janitor.py`
- `src/mvp/py/argus/service/db.py`
- `src/mvp/py/argus/service/app.py`
- `src/mvp/py/tests/test_janitor.py`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`75 passed`),覆盖率 `90.72%`(阈值 `>= 90%`)。
- 待办/风险:
- M4 只做“逻辑 + 单测”,实际 `/private/users/...` 的权限与在 `argus@h1` 的行为验证放到 M5端到端
---
## M5端到端h1— SFTPGo compose + v3.0 E2E 脚本(已完成:交付脚本/配置)
- 日期2025-12-30
- 范围:补齐 h1 端到端所需的 compose/service、配置与一键脚本实际运行/验收由你在 `argus@h1` 执行)。
- 完成内容:
- SFTPGo 集成到 `docker compose`
- 新增 `argus-sftpgo` serviceSFTP 2022Admin API/UI 8080→host 8081避免与 MVP API 8080 冲突)
- 同挂载 `../../shared:/private`,并持久化元数据到 `../../shared/common/sftpgo`
- SFTPGoAdminClient 实装(对齐 upstream OpenAPI
- `GET /api/v2/token`BasicAuth获取 admin token
- `POST /api/v2/users` 创建用户(含 `permissions: {"/":["*"]}`
- `PUT /api/v2/users/{username}` 禁用/重置密码
- 新增 v3.0 dev 配置:`configs/dev_v30.yaml`(启用 `data.sftpgo` 并配置 `admin_api_base=http://argus-sftpgo:8080/api/v2`
- 新增 v3.0 一键脚本:
- `scripts/run_all_v30_api.sh`:起 Ray+SFTPGo、启动 API、创建用户并提交 PPO/GRPO/SFT引用 user dataset 路径)
- `scripts/run_e2e_v30_cases.sh`:最小 E2E runnerHP-1
- API 启动脚本增强:`scripts/60_start_api.sh` 支持透传 `SFTPGO_ADMIN_PASSWORD` 到 head 容器内的 API 进程。
- 涉及文件:
- `src/mvp/docker-compose.yaml`
- `src/mvp/configs/dev_v30.yaml`
- `src/mvp/scripts/run_all_v30_api.sh`
- `src/mvp/scripts/run_e2e_v30_cases.sh`
- `src/mvp/scripts/60_start_api.sh`
- `src/mvp/py/argus/service/sftpgo.py`
- `src/mvp/py/tests/test_sftpgo.py`
- `src/mvp/README.md`
- `specs/mvp/v3.0/v3.0_api.md`
- 验证方式与结果:
- 本地单测:`.venv/bin/python -m pytest -q`
- 结果:全部通过(`75 passed`),覆盖率 `90.35%`(阈值 `>= 90%`)。
- 待办/风险:
- 需要你在 `argus@h1` 实跑 `scripts/run_all_v30_api.sh` 完成真正的 SFTP 上传/下载与 retention 验收(按 `v3.0_acceptance.md`)。

View File

@ -0,0 +1,166 @@
# MVP v3.0 迭代总结Ray + SFTPGo + API + WebUI
本文总结 v3.0 迭代最终落地的功能、架构、运行方式、验收点与已知限制,便于后续评审、交接与继续迭代。
相关更详细文档:
- `specs/mvp/v3.0/v3.0_design.md`
- `specs/mvp/v3.0/v3.0_api.md`
- `specs/mvp/v3.0/v3.0_dev_plan.md`
- `specs/mvp/v3.0/v3.0_acceptance.md`
- `specs/mvp/v3.0/v3.0_progress.md`
---
## 1. 目标与范围
v3.0 作为“第一版可发布”的最小闭环,主要新增:
- **WebUI**:最小可用的人机界面(登录、任务提交与查看、数据入口、管理员入口)。
- **用户管理**:基于内部 token 的用户体系admin 与普通用户),支持创建用户与签发 token。
- **数据管理入口SFTPGo**:用户通过 SFTP/WebClient 上传下载自己的数据;同时暴露只读的共享数据/缓存目录common用于复用。
- **保持训练闭环**:仍通过 Ray Job 提交到集群执行PPO/GRPO/SFT 三类 workload 都验证)。
明确不做(本迭代保持最小):
- 不支持用户自定义训练代码TaskSpec 的 `code_path` 固定走 common 下的 verl snapshot 策略)。
- 不做复杂资源排队优化/多集群/多租隔离策略(目前隔离粒度主要在用户 jobs 目录层)。
---
## 2. 系统架构(最终形态)
核心组件:
- **Ray 集群(容器)**
- `argus-ray-head`head 节点(无 GPU/不跑训练),提供 Ray Dashboard 与 Job Server。
- `argus-ray-worker-0/1`worker 节点(有 GPU承载训练任务。
- worker 以 “stateless + watchdog 自动连接 head” 的方式加入集群。
- **API Server运行在 head 容器内)**
- 读取 YAML 配置dev/prod维护任务队列sqlite并周期性调度将任务提交到 Ray。
- 同时承载 WebUI`/ui`)。
- **SFTPGo容器**
- 提供 SFTP端口 `2022`)与 Web Client/Admin端口 `8081` 映射到容器 8080
- 用户 home 为 `/private/users/<user>`,默认可读写。
- 额外提供 `/common/*` 共享只读入口(见第 4 节)。
- **共享存储NFS/GPFS 等挂载到容器内 `/private`**
- `/private/common`共享缓存hf、datasets、models、db、logs 等)。
- `/private/users/<user>`用户隔离目录jobs/datasets/models/code/trash 等)。
---
## 3. 任务与调度Task / Ray Job
### 3.1 Task平台概念
- 用户向 API 提交 TaskSpecYAML平台分配 `task_id`(可读、包含用户名)。
- `task_id` 对应内部状态机与重试逻辑;底层每次提交 Ray Job 会产生 attempt 与 `ray_submission_id`
### 3.2 Ray JobRay 概念)
- 真正执行训练的 driver 通过 Ray Job 运行在集群 worker 上(避免 head 承载训练)。
- head 节点通过 `--num-cpus=0` / 自定义资源等策略避免调度到 head。
### 3.3 VERL 资源预检查的处理
- VERL 在创建资源池时会做 fail-fast 资源预检查(如“可用 GPU 不足”直接报错退出)。
- v3.0 延续 v2.x 的策略:服务端识别失败原因并按策略重试/回退(具体见 scheduler 实现与 v2.5/3.0 文档)。
---
## 4. 数据管理SFTPGo与 common 只读目录
### 4.1 用户目录(读写)
- 用户通过 SFTP/WebClient 访问自己的 home`/private/users/<user>`
- 目录结构(至少):`datasets/ models/ code/ jobs/ trash/ common/`
### 4.2 common 只读(方案 AVirtual Folder
本迭代采用 SFTPGo 的 Virtual Folder + 路径权限覆盖,实现用户可读共享目录但不可写。
最终对外暴露为:
- `/common/datasets`(只读)
- **mapped_path 指向真实目录 `/private/datasets`**(避免 `/private/common/datasets` 中大量 symlink 导致的 WebClient “权限不足/越界”问题)
- `/common/hf`(只读)
- mapped_path 指向 `/private/hf`
备注:
- `/private/common/datasets` 内部存在 symlink`gsm8k -> /private/datasets/gsm8k`),如果虚拟目录映射到 symlink 根目录SFTPGo 会把 symlink 跳转视为“逃逸 root”导致点击进入时报权限不足因此选择直接映射到真实目录根。
---
## 5. WebUI最小可用
入口:
- `/ui/login`:粘贴 token存 browser `localStorage`
- `/ui/tasks`任务列表Running/Pending/CompletedCompleted 支持分页
- `/ui/tasks/new`提交任务PPO/GRPO/SFT 三套样例可一键填充)
- `/ui/data`:展示当前用户名、支持重置 SFTPGo 密码并复制;提供跳转到 SFTPGo WebClient提示 FileZilla 等客户端用法
- `/ui/admin`:管理员入口(创建用户、签发 token、用户列表
- 导航栏提供 Ray Dashboard 快捷跳转(当前 IP 的 `:8265`
关于 admin 页面权限:
- admin 页面本身可访问,但其数据请求必须携带 admin token否则会在页面内显示 401/403/错误信息(满足“需要先提供 admin token 才能看到内容”)。
---
## 6. APIv3.0 新增/强化点)
核心接口(节选):
- 认证:
- Bearer token`MVP_INTERNAL_TOKEN`admin或用户 token由 admin 签发)
- 用户管理admin
- `POST /api/v2/users` 创建用户(并初始化用户目录)
- `GET /api/v2/users` 获取用户列表(包含最新 token、创建/更新时间等)
- `POST /api/v2/users/{user_id}/tokens` 签发用户 token
- 任务:
- `POST /api/v2/tasks` 提交 TaskSpecYAML
- `GET /api/v2/tasks` 任务列表(支持 states/limit/offset用于 Completed 分页)
- `GET /api/v2/tasks/{task_id}``POST /api/v2/tasks/{task_id}:cancel``GET /api/v2/tasks/{task_id}/logs`
- `GET /api/v2/queue`(运行中/待调度概览)
- 数据/SFTP
- `GET /api/v2/me` 返回用户路径信息、SFTP 连接信息,并 best-effort 对齐 SFTPGo 用户配置
- `POST /api/v2/me/sftp:reset_password` 用户自助重置 SFTPGo 密码(一次性返回明文)
安全取舍说明(当前为内网/开发优先):
- 为了 Admin WebUI “可查看并复制 token”数据库持久化存储了 `token_plain`(明文 token
- 这在生产场景通常不建议;未来可改为只展示“重置/重新签发”而不回显明文,或只回显一次。
---
## 7. 持久化与清理
- 任务队列sqliteWAL 模式)
- SFTPGo自带 sqlite db容器挂载持久化目录
- Jobs 目录清理策略(服务端 janitor
- job 结束后 3 天移动到回收目录trash
- 回收目录再保留 7 天后删除
---
## 8. 运行方式与脚本
开发/验收脚本:
- `src/mvp/scripts/run_all_v30_api.sh`端到端拉起Ray + SFTPGo + API并通过 API 提交 PPO/GRPO/SFT等待完成并验收
- 其他脚本用于启动/停止 API、准备数据与模型、探测服务就绪等详见 scripts 目录与 README
典型端到端(示例参数):
- `MVP_INTERNAL_TOKEN=my-dev-token`
- `SFTPGO_ADMIN_PASSWORD=my-dev-sftpgo-admin`
- 支持 `RESET_DB/RESET_SFTPGO` 用于测试环境重置
---
## 9. 验证结果(已跑通)
`argus@h1` 环境中已完成端到端验证:
- Ray 集群可用head + 2 worker
- API server + WebUI 可用
- SFTPGoadmin + 普通用户)可用
- 通过 API 连续提交 PPO/GRPO/SFT 三种任务均能完成SUCCEEDED
- 用户可以登录 SFTPGo WebClient/SFTP访问自己的目录并访问 `/common/datasets``/common/hf` 的只读内容
同时本地单测通过:
- pytest 全绿
- 覆盖率阈值 >= 90%
---
## 10. 已知限制 & 后续可改进
- WebUI 当前为最小版,交互与权限提示仍偏“工程化”而非产品化(后续可增强错误提示、搜索筛选、任务详情聚合等)。
- token 明文持久化仅适合内网/开发场景;生产建议改为一次性展示或支持撤销/轮换策略。
- SFTPGo 虚拟目录目前保留了历史遗留映射(例如 `/common/models` 可能残留),后续可在升级脚本中做一次性清理与迁移。

View File

@ -16,3 +16,11 @@
- CLI 提交流程:`scripts/run_all_cli.sh` - CLI 提交流程:`scripts/run_all_cli.sh`
- API 提交流程:`scripts/run_all_api.sh` - API 提交流程:`scripts/run_all_api.sh`
- v2.5Stateless worker + user 隔离 jobsE2E`scripts/run_all_v25_api.sh` - v2.5Stateless worker + user 隔离 jobsE2E`scripts/run_all_v25_api.sh`
- v3.0WebUI + SFTPGo + user datasets/models + jobs retentionE2E`scripts/run_all_v30_api.sh`
v3.0 访问入口dev/h1
- WebUI`http://127.0.0.1:8080/ui`
- Ray Dashboard`http://127.0.0.1:8265`
- SFTPGo
- SFTP`127.0.0.1:2022`
- Admin API/UI`http://127.0.0.1:8081`(容器内 8080host 映射到 8081 避免与 API server 冲突)

View File

@ -32,3 +32,24 @@ service:
retry_interval_s: 60 retry_interval_s: 60
max_running_tasks: 1 max_running_tasks: 1
# v3.0: user data management (filesystem + SFTPGo)
data:
# All user writable data is placed under this root:
# /private/users/<user_id>/{datasets,models,code,jobs,trash/jobs}
user_root: "/private/users"
# SFTPGo is optional in dev; when enabled, admin endpoints will call SFTPGo admin API.
# Admin password is provided by env var `data.sftpgo.admin_password_env`.
sftpgo:
enabled: false
host: "" # shown to users via GET /api/v2/me
sftp_port: 2022
admin_api_base: "" # e.g. http://argus-sftpgo:8080
admin_user: "admin"
admin_password_env: "SFTPGO_ADMIN_PASSWORD"
# Jobs retention policy (v3.0 janitor): move to trash after 3d, purge after 7d.
retention:
jobs_trash_after_days: 3
jobs_purge_after_days: 7
janitor_interval_s: 3600

View File

@ -0,0 +1,51 @@
ray:
# Ray Job server address (head 容器内视角)
address: "http://127.0.0.1:8265"
# 共享根路径(容器内统一 /private对齐生产
shared_root: "/private"
# 强制 driver 落 workerhead 不跑训练)
entrypoint_num_cpus: 1
entrypoint_resources:
worker_node: 1
# 所有 job 通用 runtime env
runtime_env:
env_vars:
HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1"
# v3.0 先不支持 user code 执行
user_code_path: "/private/user/code"
service:
api:
host: "0.0.0.0"
port: 8080
auth:
token_env: "MVP_INTERNAL_TOKEN"
sqlite:
db_path: "/private/common/db/mvp.sqlite3"
scheduler:
tick_s: 5
retry_interval_s: 60
max_running_tasks: 1
data:
user_root: "/private/users"
sftpgo:
enabled: true
# Returned to users by GET /api/v2/me. For h1 E2E, usually connect to the host IP.
host: "127.0.0.1"
sftp_port: 2022
# Admin API base should include /api/v2 (SFTPGo OpenAPI server base).
# From head container, access SFTPGo by service name on the compose network.
admin_api_base: "http://argus-sftpgo:8080/api/v2"
admin_user: "admin"
admin_password_env: "SFTPGO_ADMIN_PASSWORD"
retention:
jobs_trash_after_days: 3
jobs_purge_after_days: 7
janitor_interval_s: 3600

View File

@ -38,6 +38,36 @@ services:
HF_ENDPOINT: "https://hf-mirror.com" HF_ENDPOINT: "https://hf-mirror.com"
PYTHONUNBUFFERED: "1" PYTHONUNBUFFERED: "1"
# v3.0: Data management service (SFTPGo).
# - SFTP: 2022
# - Admin API/UI: 8080 (mapped to host 8081 to avoid collision with MVP API server on 8080)
#
# NOTE: This is for dev / h1 E2E. In prod you may use an internal mirror image/tag and different ports.
sftpgo:
image: drakkan/sftpgo:latest
container_name: argus-sftpgo
user: "0:0"
ports:
- "2022:2022"
- "8081:8080"
volumes:
- ../../shared:/private
- ../../shared/common/sftpgo:/var/lib/sftpgo
networks:
- argus-ray-net
environment:
# Create a default admin on first start (used by API server to manage users).
# Override on host as needed:
# export SFTPGO_ADMIN_PASSWORD=...
SFTPGO_DATA_PROVIDER__CREATE_DEFAULT_ADMIN: "true"
# Persist the sqlite DB under the mounted metadata dir; otherwise it defaults to a relative path.
SFTPGO_DATA_PROVIDER__NAME: "/var/lib/sftpgo/sftpgo.db"
SFTPGO_DEFAULT_ADMIN_USERNAME: "admin"
SFTPGO_DEFAULT_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
# Explicitly pin default ports via env-var schema (double-underscore for nesting).
SFTPGO_HTTPD__BINDINGS__0__PORT: "8080"
SFTPGO_SFTPD__BINDINGS__0__PORT: "2022"
ray_worker_0: ray_worker_0:
image: argus/argus-ray-node:v2.5 image: argus/argus-ray-node:v2.5
container_name: argus-ray-worker-0 container_name: argus-ray-worker-0

View File

@ -1,6 +1,7 @@
from __future__ import annotations from __future__ import annotations
import os import os
import secrets
import threading import threading
from typing import Any from typing import Any
@ -12,7 +13,10 @@ from argus.ray.models import JobSpec, RayConfig
from .config import V2Config from .config import V2Config
from .db import Db from .db import Db
from .janitor import JobsJanitor
from .scheduler import Scheduler from .scheduler import Scheduler
from .sftpgo import SFTPGoAdminClient, SFTPGoError
from .ui import register_ui_routes
def _utc_now_iso() -> str: def _utc_now_iso() -> str:
@ -38,11 +42,48 @@ def create_app(config_path: str) -> FastAPI:
db.init() db.init()
scheduler = Scheduler(db=db, ray_cfg=ray_cfg, v2_cfg=v2_cfg) scheduler = Scheduler(db=db, ray_cfg=ray_cfg, v2_cfg=v2_cfg)
janitor = JobsJanitor(
db=db,
user_root=v2_cfg.data.user_root,
trash_after_days=v2_cfg.data.retention.jobs_trash_after_days,
purge_after_days=v2_cfg.data.retention.jobs_purge_after_days,
interval_s=v2_cfg.data.retention.janitor_interval_s,
)
stop_flag = threading.Event() stop_flag = threading.Event()
tool = scheduler.tool tool = scheduler.tool
app = FastAPI(title="mvp-v2", version="2.0") app = FastAPI(title="mvp-v2", version="2.0")
def _user_home(user_id: str) -> str:
base = v2_cfg.data.user_root.rstrip("/")
return f"{base}/{user_id}"
def _ensure_user_dirs(user_id: str) -> None:
home = _user_home(user_id)
# Note: create a physical /common directory under each user's home so that
# SFTPGo WebClient shows a "common" entry at the root. The actual shared
# content is exposed via SFTPGo virtual folders under /common/{datasets,models}.
for rel in ("datasets", "models", "code", "jobs", "trash/jobs", "common"):
os.makedirs(f"{home}/{rel}", exist_ok=True)
def _sftpgo_enabled() -> bool:
return bool(v2_cfg.data.sftpgo.enabled)
def _sftpgo_client() -> SFTPGoAdminClient:
cfg = v2_cfg.data.sftpgo
if not cfg.admin_api_base:
raise HTTPException(status_code=500, detail="sftpgo enabled but data.sftpgo.admin_api_base is empty")
pw = os.environ.get(cfg.admin_password_env, "")
if not pw:
raise HTTPException(status_code=500, detail=f"missing env: {cfg.admin_password_env}")
shared_root = ray_cfg.shared_root.rstrip("/")
return SFTPGoAdminClient(
admin_api_base=str(cfg.admin_api_base),
admin_user=str(cfg.admin_user),
admin_password=pw,
common_root=f"{shared_root}/common",
)
def _auth(req: Request) -> dict[str, Any]: def _auth(req: Request) -> dict[str, Any]:
token_env = v2_cfg.auth.token_env token_env = v2_cfg.auth.token_env
admin_token = os.environ.get(token_env, "") admin_token = os.environ.get(token_env, "")
@ -74,6 +115,9 @@ def create_app(config_path: str) -> FastAPI:
def _startup() -> None: def _startup() -> None:
t = threading.Thread(target=scheduler.run_forever, args=(stop_flag,), daemon=True) t = threading.Thread(target=scheduler.run_forever, args=(stop_flag,), daemon=True)
t.start() t.start()
if int(janitor.interval_s) > 0:
tj = threading.Thread(target=janitor.run_forever, args=(stop_flag,), daemon=True)
tj.start()
@app.on_event("shutdown") @app.on_event("shutdown")
def _shutdown() -> None: def _shutdown() -> None:
@ -93,7 +137,42 @@ def create_app(config_path: str) -> FastAPI:
row = db.create_user(user_id=user_id, display_name=str(display_name) if display_name is not None else None) row = db.create_user(user_id=user_id, display_name=str(display_name) if display_name is not None else None)
except Exception as e: except Exception as e:
raise HTTPException(status_code=409, detail=f"user create failed: {e!r}") raise HTTPException(status_code=409, detail=f"user create failed: {e!r}")
return {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")}
# v3.0: create user dir structure (datasets/models/code/jobs/trash).
try:
_ensure_user_dirs(user_id)
except Exception as e:
raise HTTPException(status_code=500, detail=f"failed to create user dirs: {e!r}")
out: dict[str, Any] = {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")}
# v3.0: scheme A (password) SFTPGo integration.
if _sftpgo_enabled():
pw = secrets.token_urlsafe(12)
try:
_sftpgo_client().create_user(username=user_id, password=pw, home_dir=_user_home(user_id))
except SFTPGoError as e:
# Make create user idempotent for retries:
# If the user already exists in SFTPGo, reset password and enable it.
if "http error: 409" in str(e):
try:
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
except Exception as e2:
raise HTTPException(status_code=502, detail=f"sftpgo upsert user failed: {e2!r}")
else:
raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}")
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}")
out["sftp"] = {
"username": user_id,
"password": pw, # one-time return to admin; do not persist plaintext
"host": v2_cfg.data.sftpgo.host,
"port": int(v2_cfg.data.sftpgo.sftp_port),
}
return out
@app.post("/api/v2/users/{user_id}/tokens") @app.post("/api/v2/users/{user_id}/tokens")
async def issue_token(user_id: str, req: Request) -> dict[str, Any]: async def issue_token(user_id: str, req: Request) -> dict[str, Any]:
@ -104,6 +183,12 @@ def create_app(config_path: str) -> FastAPI:
token = db.issue_token(user_id=user_id) token = db.issue_token(user_id=user_id)
return {"user_id": user_id, "token": token} return {"user_id": user_id, "token": token}
@app.get("/api/v2/users")
async def list_users(req: Request, limit: int = 200) -> dict[str, Any]:
_require_admin(req)
lim = max(1, min(int(limit), 1000))
return {"users": db.list_users(limit=lim), "limit": lim}
@app.post("/api/v2/users/{user_id}:disable") @app.post("/api/v2/users/{user_id}:disable")
async def disable_user(user_id: str, req: Request) -> dict[str, Any]: async def disable_user(user_id: str, req: Request) -> dict[str, Any]:
_require_admin(req) _require_admin(req)
@ -111,8 +196,122 @@ def create_app(config_path: str) -> FastAPI:
if not u: if not u:
raise HTTPException(status_code=404, detail="user not found") raise HTTPException(status_code=404, detail="user not found")
db.disable_user(user_id=user_id) db.disable_user(user_id=user_id)
if _sftpgo_enabled():
try:
_sftpgo_client().disable_user(username=user_id, home_dir=_user_home(user_id))
except Exception:
# Best-effort: DB state is source of truth for API auth; SFTPGo sync can lag.
pass
return {"user_id": user_id, "state": "DISABLED"} return {"user_id": user_id, "state": "DISABLED"}
@app.post("/api/v2/users/{user_id}/sftp:reset_password")
async def reset_sftp_password(user_id: str, req: Request) -> dict[str, Any]:
_require_admin(req)
u = db.get_user(user_id)
if not u:
raise HTTPException(status_code=404, detail="user not found")
if not _sftpgo_enabled():
raise HTTPException(status_code=400, detail="sftpgo is not enabled")
pw = secrets.token_urlsafe(12)
try:
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}")
return {"user_id": user_id, "password": pw}
@app.post("/api/v2/me/sftp:reset_password")
async def reset_my_sftp_password(req: Request) -> dict[str, Any]:
"""
v3.0 WebUI: allow a user to rotate their own SFTPGo password and copy it.
Note: SFTPGo does not allow reading the current password, so this endpoint returns a new one-time password.
"""
subject = _auth(req)
user_id = str(subject["user_id"])
u = db.get_user(user_id)
if not u:
raise HTTPException(status_code=404, detail="user not found")
if not _sftpgo_enabled():
raise HTTPException(status_code=400, detail="sftpgo is not enabled")
pw = secrets.token_urlsafe(12)
try:
_sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id))
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
except Exception as e:
raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}")
return {"user_id": user_id, "password": pw}
@app.get("/api/v2/me")
async def me(req: Request) -> dict[str, Any]:
subject = _auth(req)
user_id = str(subject["user_id"])
try:
_ensure_user_dirs(user_id)
except Exception:
# Best-effort: user may still be able to use API even if FS init fails.
pass
# Best-effort: reconcile SFTPGo user to include /common read-only mounts.
# This makes the virtual folders visible without requiring a password reset.
if _sftpgo_enabled() and not subject.get("is_admin"):
try:
_sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id))
except Exception:
pass
home = _user_home(user_id)
out: dict[str, Any] = {
"user_id": user_id,
"is_admin": bool(subject.get("is_admin")),
"paths": {
"home": home,
"datasets": f"{home}/datasets",
"models": f"{home}/models",
"code": f"{home}/code",
"jobs": f"{home}/jobs",
"trash_jobs": f"{home}/trash/jobs",
},
"retention": {
"jobs_trash_after_days": int(v2_cfg.data.retention.jobs_trash_after_days),
"jobs_purge_after_days": int(v2_cfg.data.retention.jobs_purge_after_days),
},
}
if _sftpgo_enabled():
out["sftp"] = {
"host": v2_cfg.data.sftpgo.host,
"port": int(v2_cfg.data.sftpgo.sftp_port),
"username": user_id,
}
return out
@app.get("/api/v2/tasks")
async def list_tasks(req: Request, limit: int = 200, offset: int = 0, states: str | None = None) -> dict[str, Any]:
subject = _auth(req)
lim = max(1, min(int(limit), 1000))
off = max(0, int(offset))
state_list: list[str] | None = None
if states:
raw = [s.strip() for s in str(states).split(",") if s.strip()]
# Keep it strict to avoid typos silently returning empty results.
allowed = {
"QUEUED",
"PENDING_RESOURCES",
"SUBMITTING",
"SUBMITTED",
"RUNNING",
"SUCCEEDED",
"FAILED",
"CANCELED",
}
for s in raw:
if s not in allowed:
raise HTTPException(status_code=400, detail=f"invalid state: {s}")
state_list = raw or None
if subject.get("is_admin"):
tasks = db.list_tasks(user_id=None, states=state_list, limit=lim, offset=off)
else:
tasks = db.list_tasks(user_id=str(subject["user_id"]), states=state_list, limit=lim, offset=off)
return {"tasks": tasks, "limit": lim, "offset": off, "has_more": bool(len(tasks) == lim)}
@app.post("/api/v2/tasks") @app.post("/api/v2/tasks")
async def submit_task(req: Request) -> dict[str, Any]: async def submit_task(req: Request) -> dict[str, Any]:
subject = _auth(req) subject = _auth(req)
@ -126,13 +325,40 @@ def create_app(config_path: str) -> FastAPI:
except Exception as e: except Exception as e:
raise HTTPException(status_code=400, detail=f"invalid jobspec: {e!r}") raise HTTPException(status_code=400, detail=f"invalid jobspec: {e!r}")
# v2.5 constraint: training inputs must come from /private/common (dev/prod统一)。 # v3.0 path policy:
common_prefix = ray_cfg.shared_root.rstrip("/") + "/common/" # - code_path: only allow /private/common/...
for k, v in (("code_path", spec.code_path), ("train_file", spec.train_file), ("val_file", spec.val_file)): # - train/val: allow /private/common/datasets/... OR /private/users/<me>/datasets/...
# - model_id: if it looks like a local path (/private/...), allow only models dirs:
# /private/common/models/... OR /private/users/<me>/models/...
root = ray_cfg.shared_root.rstrip("/")
common_prefix = f"{root}/common/"
user_prefix = f"{root}/users/{str(subject['user_id']).strip()}/"
common_datasets_prefix = f"{common_prefix}datasets/"
user_datasets_prefix = f"{user_prefix}datasets/"
common_models_prefix = f"{common_prefix}models/"
user_models_prefix = f"{user_prefix}models/"
if not str(spec.code_path).startswith(common_prefix):
raise HTTPException(status_code=400, detail=f"code_path must start with {common_prefix}")
for k, v in (("train_file", spec.train_file), ("val_file", spec.val_file)):
if v is None: if v is None:
continue continue
if not str(v).startswith(common_prefix): sv = str(v)
raise HTTPException(status_code=400, detail=f"{k} must start with {common_prefix}") if not (sv.startswith(common_datasets_prefix) or sv.startswith(user_datasets_prefix)):
raise HTTPException(
status_code=400,
detail=f"{k} must start with {common_datasets_prefix} or {user_datasets_prefix}",
)
model_id = str(spec.model_id)
if model_id.startswith(f"{root}/"):
if not (model_id.startswith(common_models_prefix) or model_id.startswith(user_models_prefix)):
raise HTTPException(
status_code=400,
detail=f"model_id local path must start with {common_models_prefix} or {user_models_prefix}",
)
task_id = new_task_id(spec.workload, user_id=str(subject["user_id"])) task_id = new_task_id(spec.workload, user_id=str(subject["user_id"]))
db.create_task_v25( db.create_task_v25(
@ -257,4 +483,7 @@ def create_app(config_path: str) -> FastAPI:
return db.list_queue() return db.list_queue()
return db.list_queue(user_id=str(subject["user_id"])) return db.list_queue(user_id=str(subject["user_id"]))
# v3.0: minimal WebUI (no server-side session; token stored in browser localStorage).
register_ui_routes(app)
return app return app

View File

@ -27,12 +27,37 @@ class V2SchedulerConfig:
max_running_tasks: int = 1 max_running_tasks: int = 1
@dataclass(frozen=True)
class V2RetentionConfig:
jobs_trash_after_days: int = 3
jobs_purge_after_days: int = 7
janitor_interval_s: int = 3600
@dataclass(frozen=True)
class V2SFTPGoConfig:
enabled: bool = False
host: str = ""
sftp_port: int = 2022
admin_api_base: str = ""
admin_user: str = "admin"
admin_password_env: str = "SFTPGO_ADMIN_PASSWORD"
@dataclass(frozen=True)
class V2DataConfig:
user_root: str
sftpgo: V2SFTPGoConfig
retention: V2RetentionConfig
@dataclass(frozen=True) @dataclass(frozen=True)
class V2Config: class V2Config:
api: V2ApiConfig api: V2ApiConfig
auth: V2AuthConfig auth: V2AuthConfig
sqlite: V2SqliteConfig sqlite: V2SqliteConfig
scheduler: V2SchedulerConfig scheduler: V2SchedulerConfig
data: V2DataConfig
@staticmethod @staticmethod
def from_root_dict(root: dict[str, Any]) -> "V2Config": def from_root_dict(root: dict[str, Any]) -> "V2Config":
@ -58,9 +83,19 @@ class V2Config:
else: else:
shared_root = str(root.get("shared_root") or "/private") shared_root = str(root.get("shared_root") or "/private")
data = root.get("data") or {}
if not isinstance(data, dict):
raise ValueError("config.data must be a mapping")
sftpgo = data.get("sftpgo") or {}
retention = data.get("retention") or {}
if not isinstance(sftpgo, dict) or not isinstance(retention, dict):
raise ValueError("config.data.{sftpgo,retention} must be mappings")
default_db_path = f"{shared_root}/common/db/mvp.sqlite3" default_db_path = f"{shared_root}/common/db/mvp.sqlite3"
db_path = str(sqlite.get("db_path") or default_db_path) db_path = str(sqlite.get("db_path") or default_db_path)
user_root = str(data.get("user_root") or f"{shared_root}/users")
return V2Config( return V2Config(
api=V2ApiConfig( api=V2ApiConfig(
host=str(api.get("host") or "0.0.0.0"), host=str(api.get("host") or "0.0.0.0"),
@ -73,4 +108,20 @@ class V2Config:
retry_interval_s=int(scheduler.get("retry_interval_s") or 60), retry_interval_s=int(scheduler.get("retry_interval_s") or 60),
max_running_tasks=int(scheduler.get("max_running_tasks") or 1), max_running_tasks=int(scheduler.get("max_running_tasks") or 1),
), ),
data=V2DataConfig(
user_root=user_root,
sftpgo=V2SFTPGoConfig(
enabled=bool(sftpgo.get("enabled") or False),
host=str(sftpgo.get("host") or ""),
sftp_port=int(sftpgo.get("sftp_port") or 2022),
admin_api_base=str(sftpgo.get("admin_api_base") or ""),
admin_user=str(sftpgo.get("admin_user") or "admin"),
admin_password_env=str(sftpgo.get("admin_password_env") or "SFTPGO_ADMIN_PASSWORD"),
),
retention=V2RetentionConfig(
jobs_trash_after_days=int(retention.get("jobs_trash_after_days") or 3),
jobs_purge_after_days=int(retention.get("jobs_purge_after_days") or 7),
janitor_interval_s=int(retention.get("janitor_interval_s") or 3600),
),
),
) )

View File

@ -40,7 +40,8 @@ class Db:
user_id TEXT PRIMARY KEY, user_id TEXT PRIMARY KEY,
display_name TEXT, display_name TEXT,
state TEXT NOT NULL, state TEXT NOT NULL,
created_at TEXT NOT NULL created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
) )
""" """
) )
@ -49,6 +50,7 @@ class Db:
CREATE TABLE IF NOT EXISTS api_tokens ( CREATE TABLE IF NOT EXISTS api_tokens (
token_hash TEXT PRIMARY KEY, token_hash TEXT PRIMARY KEY,
user_id TEXT NOT NULL, user_id TEXT NOT NULL,
token_plain TEXT NOT NULL,
created_at TEXT NOT NULL, created_at TEXT NOT NULL,
last_used_at TEXT, last_used_at TEXT,
FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
@ -78,6 +80,15 @@ class Db:
conn.execute("ALTER TABLE tasks ADD COLUMN user_id TEXT") conn.execute("ALTER TABLE tasks ADD COLUMN user_id TEXT")
except sqlite3.OperationalError: except sqlite3.OperationalError:
pass pass
# Best-effort: add missing columns (forward compatibility).
try:
conn.execute("ALTER TABLE users ADD COLUMN updated_at TEXT")
except sqlite3.OperationalError:
pass
try:
conn.execute("ALTER TABLE api_tokens ADD COLUMN token_plain TEXT")
except sqlite3.OperationalError:
pass
conn.execute( conn.execute(
""" """
CREATE TABLE IF NOT EXISTS attempts ( CREATE TABLE IF NOT EXISTS attempts (
@ -168,10 +179,10 @@ class Db:
with self.tx() as conn: with self.tx() as conn:
conn.execute( conn.execute(
""" """
INSERT INTO users (user_id, display_name, state, created_at) INSERT INTO users (user_id, display_name, state, created_at, updated_at)
VALUES (?, ?, 'ACTIVE', ?) VALUES (?, ?, 'ACTIVE', ?, ?)
""", """,
(user_id, display_name, now), (user_id, display_name, now, now),
) )
conn.execute( conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_CREATED', ?)", "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_CREATED', ?)",
@ -183,7 +194,7 @@ class Db:
def disable_user(self, *, user_id: str) -> None: def disable_user(self, *, user_id: str) -> None:
now = _utc_now_iso() now = _utc_now_iso()
with self.tx() as conn: with self.tx() as conn:
conn.execute("UPDATE users SET state = 'DISABLED' WHERE user_id = ?", (user_id,)) conn.execute("UPDATE users SET state = 'DISABLED', updated_at = ? WHERE user_id = ?", (now, user_id))
conn.execute( conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_DISABLED', ?)", "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_DISABLED', ?)",
(now, user_id), (now, user_id),
@ -195,15 +206,18 @@ class Db:
return dict(row) if row else None return dict(row) if row else None
def issue_token(self, *, user_id: str) -> str: def issue_token(self, *, user_id: str) -> str:
# Returns plaintext token once; stores hash only. # Returns plaintext token once.
# Note: For v3.0 WebUI admin convenience we persist the plaintext token so admins
# can re-copy it later. This is acceptable only for internal/dev deployments.
now = _utc_now_iso() now = _utc_now_iso()
token = f"mvp_u_{user_id}_{secrets.token_urlsafe(18)}" token = f"mvp_u_{user_id}_{secrets.token_urlsafe(18)}"
token_hash = self._hash_token(token) token_hash = self._hash_token(token)
with self.tx() as conn: with self.tx() as conn:
conn.execute( conn.execute(
"INSERT INTO api_tokens (token_hash, user_id, created_at) VALUES (?, ?, ?)", "INSERT INTO api_tokens (token_hash, user_id, token_plain, created_at) VALUES (?, ?, ?, ?)",
(token_hash, user_id, now), (token_hash, user_id, token, now),
) )
conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, user_id))
conn.execute( conn.execute(
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'TOKEN_ISSUED', ?)", "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'TOKEN_ISSUED', ?)",
(now, user_id), (now, user_id),
@ -226,8 +240,48 @@ class Db:
return None return None
now = _utc_now_iso() now = _utc_now_iso()
conn.execute("UPDATE api_tokens SET last_used_at = ? WHERE token_hash = ?", (now, token_hash)) conn.execute("UPDATE api_tokens SET last_used_at = ? WHERE token_hash = ?", (now, token_hash))
conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, str(row["user_id"])))
return {"user_id": row["user_id"], "state": row["state"]} return {"user_id": row["user_id"], "state": row["state"]}
def list_users(self, limit: int = 200) -> list[dict[str, Any]]:
with self._connect() as conn:
rows = conn.execute(
"""
SELECT
u.user_id,
u.display_name,
u.state,
u.created_at,
u.updated_at,
(
SELECT t.token_plain
FROM api_tokens t
WHERE t.user_id = u.user_id
ORDER BY t.created_at DESC
LIMIT 1
) AS token,
(
SELECT t.created_at
FROM api_tokens t
WHERE t.user_id = u.user_id
ORDER BY t.created_at DESC
LIMIT 1
) AS token_created_at,
(
SELECT t.last_used_at
FROM api_tokens t
WHERE t.user_id = u.user_id
ORDER BY t.created_at DESC
LIMIT 1
) AS token_last_used_at
FROM users u
ORDER BY u.created_at DESC
LIMIT ?
""",
(int(limit),),
).fetchall()
return [dict(r) for r in rows]
def get_task(self, task_id: str) -> dict[str, Any] | None: def get_task(self, task_id: str) -> dict[str, Any] | None:
with self._connect() as conn: with self._connect() as conn:
row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone() row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone()
@ -269,6 +323,40 @@ class Db:
running = conn.execute(running_sql, tuple(params)).fetchall() running = conn.execute(running_sql, tuple(params)).fetchall()
return {"pending": [dict(r) for r in pending], "running": [dict(r) for r in running]} return {"pending": [dict(r) for r in pending], "running": [dict(r) for r in running]}
def list_tasks(
self,
*,
user_id: str | None = None,
states: list[str] | None = None,
limit: int = 200,
offset: int = 0,
) -> list[dict[str, Any]]:
"""
Returns recent tasks (including terminal tasks).
"""
with self._connect() as conn:
params: list[Any] = []
where_clauses: list[str] = []
if user_id is not None:
where_clauses.append("user_id = ?")
params.append(user_id)
if states:
placeholders = ",".join(["?"] * len(states))
where_clauses.append(f"state IN ({placeholders})")
params.extend(states)
where_sql = f" WHERE {' AND '.join(where_clauses)}" if where_clauses else ""
sql = (
"SELECT task_id, user_id, workload, state, nnodes, n_gpus_per_node, latest_attempt_no, created_at, updated_at, error_summary "
"FROM tasks"
f"{where_sql} "
"ORDER BY created_at DESC "
"LIMIT ? OFFSET ?"
)
params.append(int(limit))
params.append(max(0, int(offset)))
rows = conn.execute(sql, tuple(params)).fetchall()
return [dict(r) for r in rows]
def count_running(self) -> int: def count_running(self) -> int:
with self._connect() as conn: with self._connect() as conn:
row = conn.execute( row = conn.execute(
@ -383,3 +471,25 @@ class Db:
"INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_UPDATE', ?)", "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_UPDATE', ?)",
(task_id, now, None), (task_id, now, None),
) )
def list_ended_attempts_before(self, *, end_time_le: str, limit: int = 2000) -> list[dict[str, Any]]:
"""
Used by the jobs janitor:
- Only considers tasks with a non-null user_id (v2.5+).
- Returns attempts that have end_time <= end_time_le.
"""
with self._connect() as conn:
rows = conn.execute(
"""
SELECT t.task_id, t.user_id, a.ray_submission_id, a.end_time
FROM attempts a
JOIN tasks t ON t.task_id = a.task_id
WHERE t.user_id IS NOT NULL
AND a.end_time IS NOT NULL
AND a.end_time <= ?
ORDER BY a.end_time ASC
LIMIT ?
""",
(str(end_time_le), int(limit)),
).fetchall()
return [dict(r) for r in rows]

View File

@ -0,0 +1,105 @@
from __future__ import annotations
import os
import shutil
import time
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from .db import Db
def _parse_iso_z(ts: str) -> datetime:
# Stored as "YYYY-MM-DDTHH:MM:SSZ" (naive UTC). Parse into aware UTC.
return datetime.fromisoformat(ts.replace("Z", "+00:00")).astimezone(timezone.utc)
def _iso_z(dt: datetime) -> str:
return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
@dataclass
class JobsJanitor:
db: Db
user_root: str
trash_after_days: int = 3
purge_after_days: int = 7
interval_s: int = 3600
def __post_init__(self) -> None:
if int(self.trash_after_days) < 0 or int(self.purge_after_days) < 0:
raise ValueError("retention days must be non-negative")
if int(self.purge_after_days) and int(self.purge_after_days) < int(self.trash_after_days):
raise ValueError("purge_after_days must be >= trash_after_days")
def _job_dir(self, *, user_id: str, ray_submission_id: str) -> str:
base = self.user_root.rstrip("/")
return f"{base}/{user_id}/jobs/{ray_submission_id}"
def _trash_dir(self, *, user_id: str, ray_submission_id: str) -> str:
base = self.user_root.rstrip("/")
return f"{base}/{user_id}/trash/jobs/{ray_submission_id}"
def tick_once(self, *, now: datetime | None = None, limit: int = 2000) -> None:
if int(self.trash_after_days) <= 0 and int(self.purge_after_days) <= 0:
return
now_dt = now or datetime.now(timezone.utc)
move_cutoff = now_dt - timedelta(days=int(self.trash_after_days))
move_cutoff_iso = _iso_z(move_cutoff)
rows = self.db.list_ended_attempts_before(end_time_le=move_cutoff_iso, limit=int(limit))
for r in rows:
user_id = str(r.get("user_id") or "").strip()
sid = str(r.get("ray_submission_id") or "").strip()
end_time = str(r.get("end_time") or "").strip()
if not user_id or not sid or not end_time:
continue
try:
ended_at = _parse_iso_z(end_time)
except Exception:
continue
age_days = (now_dt - ended_at).total_seconds() / 86400.0
src = self._job_dir(user_id=user_id, ray_submission_id=sid)
dst = self._trash_dir(user_id=user_id, ray_submission_id=sid)
dst_parent = os.path.dirname(dst)
# Between trash and purge: ensure in trash.
if age_days >= float(self.trash_after_days) and age_days < float(self.purge_after_days or 10**9):
if os.path.exists(src) and not os.path.exists(dst):
os.makedirs(dst_parent, exist_ok=True)
try:
shutil.move(src, dst)
except Exception:
pass
continue
# Purge: move to trash (if still in jobs) then delete from trash.
if int(self.purge_after_days) > 0 and age_days >= float(self.purge_after_days):
if os.path.exists(src) and not os.path.exists(dst):
os.makedirs(dst_parent, exist_ok=True)
try:
shutil.move(src, dst)
except Exception:
pass
if os.path.exists(dst):
try:
shutil.rmtree(dst)
except Exception:
pass
def run_forever(self, stop_flag: object) -> None:
try:
is_set = stop_flag.is_set # type: ignore[attr-defined]
except Exception:
raise ValueError("stop_flag must be a threading.Event-like object with is_set()")
while not is_set():
try:
self.tick_once()
except Exception:
pass
time.sleep(max(1, int(self.interval_s)))

View File

@ -0,0 +1,221 @@
from __future__ import annotations
import base64
import json
from dataclasses import dataclass
from urllib.error import HTTPError, URLError
from urllib.request import Request, urlopen
class SFTPGoError(RuntimeError):
pass
@dataclass(frozen=True)
class SFTPGoAdminClient:
"""
Minimal SFTPGo admin API client (v3.0).
SFTPGo OpenAPI documents the admin token flow:
- GET /api/v2/token with HTTP BasicAuth -> returns {"access_token": "..."}.
- Use Authorization: Bearer <token> for admin endpoints like POST /api/v2/users.
See upstream OpenAPI for details:
https://raw.githubusercontent.com/drakkan/sftpgo/main/openapi/openapi.yaml
"""
admin_api_base: str
admin_user: str
admin_password: str
common_root: str = "/private/common"
def _url(self, path: str) -> str:
base = self.admin_api_base.rstrip("/")
p = path if path.startswith("/") else f"/{path}"
return f"{base}{p}"
def _basic_auth_header(self) -> str:
raw = f"{self.admin_user}:{self.admin_password}".encode("utf-8")
return "Basic " + base64.b64encode(raw).decode("ascii")
def _get_json(self, url: str, headers: dict[str, str]) -> dict:
req = Request(url, headers=headers, method="GET")
try:
with urlopen(req, timeout=10) as resp:
return json.loads(resp.read().decode("utf-8"))
except HTTPError as e:
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
except URLError as e:
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
def _post_json(self, url: str, payload: dict, headers: dict[str, str]) -> None:
data = json.dumps(payload).encode("utf-8")
req = Request(url, data=data, headers=headers, method="POST")
try:
with urlopen(req, timeout=10) as resp:
_ = resp.read()
except HTTPError as e:
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
except URLError as e:
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
def _put_json(self, url: str, payload: dict, headers: dict[str, str]) -> None:
data = json.dumps(payload).encode("utf-8")
req = Request(url, data=data, headers=headers, method="PUT")
try:
with urlopen(req, timeout=10) as resp:
_ = resp.read()
except HTTPError as e:
raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e
except URLError as e:
raise SFTPGoError(f"sftpgo connection error: {e!r}") from e
def _admin_token(self) -> str:
url = self._url("/token")
obj = self._get_json(url, headers={"Authorization": self._basic_auth_header()})
tok = str(obj.get("access_token") or "").strip()
if not tok:
raise SFTPGoError("sftpgo token response missing access_token")
return tok
def _auth_headers(self, tok: str) -> dict[str, str]:
return {"Authorization": f"Bearer {tok}", "Content-Type": "application/json"}
def _ensure_folder(self, *, tok: str, name: str, mapped_path: str) -> None:
url = self._url("/folders")
try:
self._post_json(url, {"name": name, "mapped_path": mapped_path}, headers=self._auth_headers(tok))
except SFTPGoError as e:
# Idempotent + self-healing:
# If it already exists, update mapped_path to the desired value.
if "409" in str(e):
self._put_json(
self._url(f"/folders/{name}"),
{"name": name, "mapped_path": mapped_path},
headers=self._auth_headers(tok),
)
return
raise
def _ensure_common_folders(self, *, tok: str) -> None:
# Important: do NOT map datasets to /private/common/datasets because that path is
# a symlink farm (e.g. gsm8k -> /private/datasets/gsm8k) and SFTPGo WebClient
# will treat symlink traversal as escaping the virtual folder root, resulting
# in "permission denied". Map directly to the real datasets root.
common = self.common_root.rstrip("/")
if common.endswith("/common"):
shared_root = common[: -len("/common")] or "/private"
else:
shared_root = "/private"
self._ensure_folder(tok=tok, name="common_datasets", mapped_path=f"{shared_root}/datasets")
# Expose HF cache read-only so users can inspect downloaded models/datasets.
self._ensure_folder(tok=tok, name="common_hf", mapped_path=f"{shared_root}/hf")
def _get_user(self, *, tok: str, username: str) -> dict:
url = self._url(f"/users/{username}")
return self._get_json(url, headers={"Authorization": f"Bearer {tok}"})
def _put_user(self, *, tok: str, username: str, payload: dict) -> None:
url = self._url(f"/users/{username}")
self._put_json(url, payload, headers=self._auth_headers(tok))
def _apply_common_readonly(self, user_payload: dict) -> dict:
# Path-based permissions: make /common/* read-only while keeping home writeable.
perms = dict(user_payload.get("permissions") or {"/": ["*"]})
# Ensure /common is visible as a directory and can be traversed.
perms["/common"] = ["list"]
perms["/common/datasets"] = ["list", "download"]
perms["/common/hf"] = ["list", "download"]
user_payload["permissions"] = perms
desired_vf = [
{"name": "common_datasets", "virtual_path": "/common/datasets"},
{"name": "common_hf", "virtual_path": "/common/hf"},
]
existing = user_payload.get("virtual_folders") or []
if not isinstance(existing, list):
existing = []
seen = {(vf.get("name"), vf.get("virtual_path")) for vf in existing if isinstance(vf, dict)}
merged = list(existing)
for vf in desired_vf:
key = (vf["name"], vf["virtual_path"])
if key not in seen:
merged.append(vf)
user_payload["virtual_folders"] = merged
return user_payload
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
url = self._url("/users")
payload: dict = {
"status": 1,
"username": username,
"password": password,
"home_dir": home_dir,
"permissions": {
"/": ["*"],
"/common": ["list"],
"/common/datasets": ["list", "download"],
"/common/hf": ["list", "download"],
},
"virtual_folders": [
{"name": "common_datasets", "virtual_path": "/common/datasets"},
{"name": "common_hf", "virtual_path": "/common/hf"},
],
}
self._post_json(
url,
payload,
headers=self._auth_headers(tok),
)
def disable_user(self, *, username: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
cur = self._get_user(tok=tok, username=username)
payload: dict = {
"username": username,
"status": 0,
"home_dir": home_dir,
"uid": cur.get("uid", 0),
"gid": cur.get("gid", 0),
"permissions": cur.get("permissions") or {"/": ["*"]},
"virtual_folders": cur.get("virtual_folders") or [],
}
self._apply_common_readonly(payload)
self._put_user(tok=tok, username=username, payload=payload)
def enable_user(self, *, username: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
cur = self._get_user(tok=tok, username=username)
payload: dict = {
"username": username,
"status": 1,
"home_dir": home_dir,
"uid": cur.get("uid", 0),
"gid": cur.get("gid", 0),
"permissions": cur.get("permissions") or {"/": ["*"]},
"virtual_folders": cur.get("virtual_folders") or [],
}
self._apply_common_readonly(payload)
self._put_user(tok=tok, username=username, payload=payload)
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
tok = self._admin_token()
self._ensure_common_folders(tok=tok)
cur = self._get_user(tok=tok, username=username)
payload: dict = {
"username": username,
"status": 1,
"home_dir": home_dir,
"uid": cur.get("uid", 0),
"gid": cur.get("gid", 0),
"permissions": cur.get("permissions") or {"/": ["*"]},
"virtual_folders": cur.get("virtual_folders") or [],
"password": new_password,
}
self._apply_common_readonly(payload)
self._put_user(tok=tok, username=username, payload=payload)

View File

@ -0,0 +1,629 @@
from __future__ import annotations
import html
import json
from fastapi import FastAPI
from fastapi.responses import HTMLResponse, RedirectResponse
_BASE_CSS = """
:root { --bg:#0b1020; --panel:#111a33; --muted:#95a3c6; --fg:#e8eeff; --accent:#7aa2ff; --danger:#ff6b6b; --ok:#3ddc97; }
* { box-sizing: border-box; }
body { margin:0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background:var(--bg); color:var(--fg); }
a { color:var(--accent); text-decoration:none; }
.layout { display:flex; min-height:100vh; }
.nav { width: 240px; padding:16px; background: linear-gradient(180deg, #0e1630, #0b1020); border-right: 1px solid rgba(255,255,255,0.06); }
.brand { font-weight: 700; letter-spacing: .2px; margin-bottom: 12px; }
.nav a { display:block; padding:10px 10px; border-radius:10px; color: var(--fg); opacity: .9; }
.nav a.active { background: rgba(122,162,255,0.14); border: 1px solid rgba(122,162,255,0.22); }
.nav a:hover { background: rgba(255,255,255,0.06); }
.main { flex:1; padding: 20px 24px; }
.card { background: rgba(255,255,255,0.04); border: 1px solid rgba(255,255,255,0.08); border-radius: 14px; padding: 16px; }
.row { display:flex; gap: 12px; align-items:center; flex-wrap: wrap; }
.muted { color: var(--muted); }
.btn { border: 1px solid rgba(255,255,255,0.16); background: rgba(255,255,255,0.06); color: var(--fg); padding: 10px 12px; border-radius: 10px; cursor: pointer; }
.btn:hover { background: rgba(255,255,255,0.10); }
.btn.danger { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.10); }
.pill { display:inline-block; padding: 2px 10px; border-radius: 999px; border: 1px solid rgba(255,255,255,0.16); font-size: 12px; }
.pill.ok { border-color: rgba(61,220,151,0.35); background: rgba(61,220,151,0.12); }
.pill.bad { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.12); }
textarea, input { width: 100%; color: var(--fg); background: rgba(255,255,255,0.06); border: 1px solid rgba(255,255,255,0.12); border-radius: 12px; padding: 10px 12px; outline: none; }
button:disabled { opacity: .45; cursor: not-allowed; }
pre { white-space: pre-wrap; word-break: break-word; }
table { width:100%; border-collapse: collapse; }
th, td { padding: 10px 8px; border-bottom: 1px solid rgba(255,255,255,0.08); text-align:left; }
""".strip()
_BASE_JS = """
function mvpTokenGet() {
return (localStorage.getItem("mvp_token") || "").trim();
}
function mvpTokenSet(v) {
localStorage.setItem("mvp_token", (v || "").trim());
}
function mvpSftpPasswordGet() {
return (localStorage.getItem("mvp_sftp_password") || "").trim();
}
function mvpSftpPasswordSet(v) {
localStorage.setItem("mvp_sftp_password", (v || "").trim());
}
async function apiFetch(path, opts) {
opts = opts || {};
opts.headers = opts.headers || {};
const tok = mvpTokenGet();
if (tok) opts.headers["Authorization"] = "Bearer " + tok;
return fetch(path, opts);
}
async function apiJson(path, opts) {
const resp = await apiFetch(path, opts);
const text = await resp.text();
if (!resp.ok) {
const err = new Error("HTTP " + resp.status);
err.status = resp.status;
err.body = text;
throw err;
}
return JSON.parse(text);
}
function fmtJson(obj) {
try { return JSON.stringify(obj, null, 2); } catch (e) { return String(obj); }
}
function curOriginWithPort(port) {
const proto = window.location.protocol;
const host = window.location.hostname;
return proto + "//" + host + ":" + port;
}
async function copyText(v) {
if (!v) return false;
try {
await navigator.clipboard.writeText(v);
return true;
} catch (e) {
// Fallback for non-secure contexts (http) or older browsers.
try {
const ta = document.createElement("textarea");
ta.value = v;
ta.style.position = "fixed";
ta.style.opacity = "0";
document.body.appendChild(ta);
ta.focus();
ta.select();
const ok = document.execCommand("copy");
document.body.removeChild(ta);
return ok;
} catch (e2) {
return false;
}
}
}
document.addEventListener("DOMContentLoaded", () => {
const el = document.getElementById("nav-ray-dashboard");
if (el) el.href = curOriginWithPort(8265);
});
""".strip()
def _nav(active: str) -> str:
links = [
("login", "/ui/login", "Login"),
("tasks", "/ui/tasks", "Tasks"),
("new", "/ui/tasks/new", "New Task"),
("data", "/ui/data", "Data"),
("admin", "/ui/admin", "Admin"),
("ray", "#", "Ray Dashboard"),
]
items = []
for key, href, label in links:
cls = "active" if key == active else ""
extra = ""
if key == "ray":
extra = ' id="nav-ray-dashboard" target="_blank" rel="noopener"'
items.append(f'<a class="{cls}" href="{href}"{extra}>{html.escape(label)}</a>')
return "\n".join(items)
def _page(title: str, active: str, body: str, script: str = "") -> str:
return f"""<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>{html.escape(title)}</title>
<style>{_BASE_CSS}</style>
</head>
<body>
<div class="layout">
<nav class="nav">
<div class="brand">Argus MVP</div>
{_nav(active)}
<div style="margin-top:12px" class="muted">
Token stored in browser localStorage as <code>mvp_token</code>.
</div>
</nav>
<main class="main">
{body}
</main>
</div>
<script>{_BASE_JS}</script>
<script>{script}</script>
</body>
</html>"""
def register_ui_routes(app: FastAPI) -> None:
@app.get("/ui")
async def ui_root() -> RedirectResponse:
return RedirectResponse(url="/ui/tasks")
@app.get("/ui/login")
async def ui_login() -> HTMLResponse:
body = """
<h1>Login</h1>
<div class="card">
<div class="muted">Paste your API token (without the <code>Bearer</code> prefix).</div>
<div style="height:10px"></div>
<input id="tok" placeholder="token..." />
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="save">Save</button>
<button class="btn" id="clear">Clear</button>
<a href="/ui/tasks" class="btn" style="display:inline-block">Go to Tasks</a>
</div>
<div style="height:10px"></div>
<div id="msg" class="muted"></div>
</div>
""".strip()
script = """
const tokEl = document.getElementById("tok");
const msg = document.getElementById("msg");
tokEl.value = mvpTokenGet();
document.getElementById("save").onclick = () => { mvpTokenSet(tokEl.value); msg.textContent = "Saved."; };
document.getElementById("clear").onclick = () => { mvpTokenSet(""); tokEl.value = ""; msg.textContent = "Cleared."; };
""".strip()
return HTMLResponse(content=_page("Login", "login", body, script))
@app.get("/ui/tasks")
async def ui_tasks() -> HTMLResponse:
body = """
<h1>Tasks</h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<a class="btn" href="/ui/tasks/new" style="display:inline-block">New Task</a>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
const out = document.getElementById("out");
async function refresh() {
out.textContent = "Loading...";
try {
const q = await apiJson("/api/v2/queue");
const completedLimit = 25;
const completedOffset = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const done = await apiJson("/api/v2/tasks?limit=" + completedLimit + "&offset=" + completedOffset + "&states=SUCCEEDED,FAILED,CANCELED");
function pill(state) {
const s = String(state || "");
if (s === "SUCCEEDED") return `<span class="pill ok">${s}</span>`;
if (s === "FAILED") return `<span class="pill bad">${s}</span>`;
if (s === "CANCELED") return `<span class="pill">${s}</span>`;
if (s === "RUNNING") return `<span class="pill ok">${s}</span>`;
if (s === "QUEUED" || s === "PENDING_RESOURCES" || s === "SUBMITTING" || s === "SUBMITTED") return `<span class="pill">${s}</span>`;
return `<span class="pill">${s}</span>`;
}
function row(t) {
const id = t.task_id;
return `<tr>
<td><a href="/ui/tasks/${id}">${id}</a></td>
<td>${t.workload}</td>
<td>${pill(t.state)}</td>
<td>${t.nnodes} x ${t.n_gpus_per_node} GPU</td>
<td>${t.updated_at || ""}</td>
</tr>`;
}
const running = (q.running || []).map(row).join("");
const pending = (q.pending || []).map(row).join("");
const doneRows = (done.tasks || []).map(row).join("");
const pageNo = Math.floor(completedOffset / completedLimit) + 1;
const prevDisabled = completedOffset <= 0;
const nextDisabled = !done.has_more;
out.innerHTML = `
<div class="muted">Tip: configure token in <a href="/ui/login">Login</a>.</div>
<div style="height:10px"></div>
<h3>Running</h3>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${running || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
<div style="height:12px"></div>
<h3>Pending</h3>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${pending || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
<div style="height:12px"></div>
<h3>Completed</h3>
<div class="row" style="justify-content: space-between; margin-bottom: 8px;">
<div class="muted">Page ${pageNo}</div>
<div class="row">
<button class="btn" id="done-prev" ${prevDisabled ? "disabled" : ""}>Prev</button>
<button class="btn" id="done-next" ${nextDisabled ? "disabled" : ""}>Next</button>
</div>
</div>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${doneRows || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
`;
const prevBtn = document.getElementById("done-prev");
const nextBtn = document.getElementById("done-next");
if (prevBtn) prevBtn.onclick = () => {
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const next = Math.max(0, cur - completedLimit);
localStorage.setItem("mvp_completed_offset", String(next));
refresh();
};
if (nextBtn) nextBtn.onclick = () => {
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const next = cur + completedLimit;
localStorage.setItem("mvp_completed_offset", String(next));
refresh();
};
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
}
document.getElementById("refresh").onclick = refresh;
refresh();
""".strip()
return HTMLResponse(content=_page("Tasks", "tasks", body, script))
@app.get("/ui/tasks/new")
async def ui_new_task() -> HTMLResponse:
ppo = """# PPO TaskSpec (YAML)
workload: ppo
nnodes: 2
n_gpus_per_node: 4
code_path: /private/common/code/verl/verl_repo
train_file: /private/common/datasets/gsm8k/train.parquet
val_file: /private/common/datasets/gsm8k/test.parquet
model_id: Qwen/Qwen2.5-0.5B-Instruct
""".strip()
grpo = """# GRPO TaskSpec (YAML)
workload: grpo
nnodes: 2
n_gpus_per_node: 4
code_path: /private/common/code/verl/verl_repo
train_file: /private/common/datasets/gsm8k/train.parquet
val_file: /private/common/datasets/gsm8k/test.parquet
model_id: Qwen/Qwen2.5-0.5B-Instruct
""".strip()
sft = """# SFT TaskSpec (YAML)
workload: sft
nnodes: 1
n_gpus_per_node: 1
code_path: /private/common/code/verl/verl_repo
train_file: /private/common/datasets/gsm8k_sft/train.parquet
val_file: /private/common/datasets/gsm8k_sft/test.parquet
model_id: Qwen/Qwen2.5-0.5B-Instruct
""".strip()
body = f"""
<h1>New Task</h1>
<div class="card">
<div class="muted">Paste TaskSpec YAML and submit to API server. Note: <code>code_path</code> is required (v3.0 does not execute user code; use the common snapshot).</div>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="tpl-ppo">PPO example</button>
<button class="btn" id="tpl-grpo">GRPO example</button>
<button class="btn" id="tpl-sft">SFT example</button>
</div>
<div style="height:10px"></div>
<textarea id="yaml" rows="16">{html.escape(ppo)}</textarea>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="submit">Submit</button>
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="msg" class="muted"></pre>
</div>
""".strip()
tpl_ppo = json.dumps(ppo)
tpl_grpo = json.dumps(grpo)
tpl_sft = json.dumps(sft)
script = (
"""
const msg = document.getElementById("msg");
const yamlEl = document.getElementById("yaml");
const TPL_PPO = __TPL_PPO__;
const TPL_GRPO = __TPL_GRPO__;
const TPL_SFT = __TPL_SFT__;
document.getElementById("tpl-ppo").onclick = () => { yamlEl.value = TPL_PPO; msg.textContent = ""; };
document.getElementById("tpl-grpo").onclick = () => { yamlEl.value = TPL_GRPO; msg.textContent = ""; };
document.getElementById("tpl-sft").onclick = () => { yamlEl.value = TPL_SFT; msg.textContent = ""; };
document.getElementById("submit").onclick = async () => {
msg.textContent = "Submitting...";
const body = yamlEl.value;
const resp = await apiFetch("/api/v2/tasks", { method: "POST", headers: {"Content-Type":"text/plain"}, body });
const text = await resp.text();
if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
msg.textContent = "OK: " + fmtJson(obj);
if (obj.task_id) window.location.href = "/ui/tasks/" + obj.task_id;
};
""".strip()
.replace("__TPL_PPO__", tpl_ppo)
.replace("__TPL_GRPO__", tpl_grpo)
.replace("__TPL_SFT__", tpl_sft)
)
return HTMLResponse(content=_page("New Task", "new", body, script))
@app.get("/ui/tasks/{task_id}")
async def ui_task_detail(task_id: str) -> HTMLResponse:
safe_id = html.escape(task_id)
body = f"""
<h1>Task: <code>{safe_id}</code></h1>
<div class="card">
<div class="row">
<a class="btn" href="/ui/tasks/{safe_id}/logs" style="display:inline-block">Logs</a>
<button class="btn" id="refresh">Refresh</button>
<button class="btn danger" id="cancel">Cancel</button>
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
async function refresh() {{
out.textContent = "Loading...";
const resp = await apiFetch("/api/v2/tasks/{task_id}");
const text = await resp.text();
if (!resp.ok) {{ out.textContent = "Error: " + resp.status + "\\n" + text; return; }}
out.textContent = fmtJson(JSON.parse(text));
}}
document.getElementById("refresh").onclick = refresh;
document.getElementById("cancel").onclick = async () => {{
if (!confirm("Cancel this task?")) return;
const resp = await apiFetch("/api/v2/tasks/{task_id}:cancel", {{ method: "POST" }});
const text = await resp.text();
out.textContent = (resp.ok ? "Canceled.\\n" : "Error: " + resp.status + "\\n") + text;
setTimeout(refresh, 800);
}};
refresh();
""".strip()
return HTMLResponse(content=_page(f"Task {task_id}", "tasks", body, script))
@app.get("/ui/tasks/{task_id}/logs")
async def ui_task_logs(task_id: str) -> HTMLResponse:
safe_id = html.escape(task_id)
body = f"""
<h1>Logs: <code>{safe_id}</code></h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<label class="muted">Auto refresh <input type="checkbox" id="auto" /></label>
<a class="btn" href="/ui/tasks/{safe_id}" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
let timer = null;
async function refresh() {{
const resp = await apiFetch("/api/v2/tasks/{task_id}/logs?tail=4000");
const text = await resp.text();
out.textContent = resp.ok ? text : ("Error: " + resp.status + "\\n" + text);
}}
document.getElementById("refresh").onclick = refresh;
document.getElementById("auto").onchange = (e) => {{
if (e.target.checked) {{
timer = setInterval(refresh, 2000);
}} else {{
if (timer) clearInterval(timer);
timer = null;
}}
}};
refresh();
""".strip()
return HTMLResponse(content=_page(f"Logs {task_id}", "tasks", body, script))
@app.get("/ui/data")
async def ui_data() -> HTMLResponse:
body = """
<h1>Data</h1>
<div class="card">
<div class="muted">User files live under your home directory. Keep long-term artifacts in <code>models/</code> or <code>datasets/</code>.</div>
<div style="height:14px"></div>
<div class="row">
<div style="flex:1; min-width:260px">
<div class="muted">Username</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<input id="u" readonly />
<button class="btn" id="copy-u">Copy</button>
</div>
</div>
<div style="flex:1; min-width:260px">
<div class="muted">SFTPGo password</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<input id="p" placeholder="Click Reset to generate..." />
<button class="btn" id="copy-p">Copy</button>
<button class="btn" id="reset-p">Reset</button>
</div>
</div>
</div>
<div style="height:12px"></div>
<div class="row">
<a class="btn" id="sftp-web" target="_blank" rel="noopener" href="#">Open SFTPGo Web Client (:8081)</a>
</div>
<div style="height:12px"></div>
<div class="muted">
You can also use an SFTP client (e.g. FileZilla) with the same username/password.
Host: <code id="sftp-host"></code>, Port: <code id="sftp-port"></code>.
</div>
<div style="height:14px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = """
const out = document.getElementById("out");
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const u = document.getElementById("u");
const p = document.getElementById("p");
const sftpWeb = document.getElementById("sftp-web");
const sftpHost = document.getElementById("sftp-host");
const sftpPort = document.getElementById("sftp-port");
document.getElementById("copy-u").onclick = async () => { await copyText(u.value || ""); };
document.getElementById("copy-p").onclick = async () => { await copyText(p.value || ""); };
async function refresh() {
const resp = await apiFetch("/api/v2/me");
const text = await resp.text();
if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
u.value = (obj.user_id || "");
const cached = mvpSftpPasswordGet();
if (cached) p.value = cached;
const host = curOriginWithPort(8081);
sftpWeb.href = host + "/web/client";
sftpHost.textContent = (obj.sftp && obj.sftp.host) ? obj.sftp.host : window.location.hostname;
sftpPort.textContent = (obj.sftp && obj.sftp.port) ? String(obj.sftp.port) : "2022";
out.textContent = fmtJson(obj);
}
document.getElementById("reset-p").onclick = async () => {
p.value = "";
const resp = await apiFetch("/api/v2/me/sftp:reset_password", { method: "POST" });
const text = await resp.text();
if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
p.value = obj.password || "";
mvpSftpPasswordSet(p.value);
out.textContent = "SFTPGo password rotated.\\n\\n" + fmtJson(obj);
};
refresh();
""".strip()
return HTMLResponse(content=_page("Data", "data", body, script))
@app.get("/ui/admin")
async def ui_admin() -> HTMLResponse:
body = """
<h1>Admin</h1>
<div class="card">
<div class="muted">
This page requires the <code>admin</code> token (set it in <a href="/ui/login">Login</a>).
</div>
<div style="height:14px"></div>
<h3>Create user</h3>
<div class="row">
<input id="new-user-id" placeholder="user_id (e.g. alice)" style="max-width:320px" />
<input id="new-display-name" placeholder="display_name (optional)" style="max-width:320px" />
<button class="btn" id="create-user">Create</button>
</div>
<div style="height:10px"></div>
<pre id="create-msg" class="muted"></pre>
<div style="height:14px"></div>
<div class="row">
<button class="btn" id="refresh">Refresh</button>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
const out = document.getElementById("out");
const createMsg = document.getElementById("create-msg");
const userIdEl = document.getElementById("new-user-id");
const displayNameEl = document.getElementById("new-display-name");
function esc(s) {
s = String(s || "");
return s.replaceAll("&","&amp;").replaceAll("<","&lt;").replaceAll(">","&gt;");
}
async function refresh() {
out.textContent = "Loading...";
try {
const obj = await apiJson("/api/v2/users?limit=200");
const users = (obj.users || []);
function row(u) {
const uid = u.user_id;
const tok = u.token || "";
const tokShort = tok ? (tok.length > 18 ? (tok.slice(0, 18) + "") : tok) : "";
const created = u.created_at || "";
const updated = u.updated_at || "";
const tCreated = u.token_created_at || "";
const tUsed = u.token_last_used_at || "";
return `<tr>
<td><code>${esc(uid)}</code></td>
<td class="muted">${esc(created)}</td>
<td class="muted">${esc(updated)}</td>
<td>
<div class="row" style="gap:8px">
<code title="${esc(tok)}">${esc(tokShort)}</code>
<button class="btn" data-copy="${esc(tok)}">Copy</button>
<button class="btn" data-issue="${esc(uid)}">Issue token</button>
</div>
<div class="muted" style="margin-top:6px">token_created_at: ${esc(tCreated)}; last_used_at: ${esc(tUsed)}</div>
</td>
</tr>`;
}
out.innerHTML = `
<table>
<thead><tr><th>User</th><th>Created</th><th>Updated</th><th>Token</th></tr></thead>
<tbody>${users.map(row).join("") || "<tr><td colspan=4 class=muted>(none)</td></tr>"}</tbody>
</table>
`;
for (const btn of out.querySelectorAll("button[data-copy]")) {
btn.onclick = async () => { await copyText(btn.getAttribute("data-copy") || ""); };
}
for (const btn of out.querySelectorAll("button[data-issue]")) {
btn.onclick = async () => {
const uid = btn.getAttribute("data-issue");
if (!uid) return;
try {
const r = await apiJson("/api/v2/users/" + encodeURIComponent(uid) + "/tokens", { method: "POST" });
createMsg.textContent = "Issued token:\\n" + fmtJson(r);
await refresh();
} catch (e) {
createMsg.textContent = "Error issuing token: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
}
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
}
document.getElementById("refresh").onclick = refresh;
document.getElementById("create-user").onclick = async () => {
createMsg.textContent = "Creating...";
const user_id = (userIdEl.value || "").trim();
const display_name = (displayNameEl.value || "").trim();
if (!user_id) { createMsg.textContent = "user_id is required"; return; }
const payload = { user_id: user_id };
if (display_name) payload.display_name = display_name;
try {
const r = await apiJson("/api/v2/users", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(payload) });
createMsg.textContent = "Created:\\n" + fmtJson(r);
userIdEl.value = "";
displayNameEl.value = "";
await refresh();
} catch (e) {
createMsg.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
refresh();
""".strip()
return HTMLResponse(content=_page("Admin", "admin", body, script))

View File

@ -16,6 +16,11 @@ def _write_config(tmp_path: Path) -> Path:
"entrypoint_resources": {"worker_node": 1}, "entrypoint_resources": {"worker_node": 1},
"runtime_env": {"env_vars": {}}, "runtime_env": {"env_vars": {}},
}, },
"data": {
# Avoid touching real /private in tests. Keep ray.shared_root as /private
# so existing path validation tests remain unchanged.
"user_root": str(tmp_path / "users"),
},
"service": { "service": {
"api": {"host": "127.0.0.1", "port": 0}, "api": {"host": "127.0.0.1", "port": 0},
"auth": {"token_env": "MVP_INTERNAL_TOKEN"}, "auth": {"token_env": "MVP_INTERNAL_TOKEN"},
@ -95,6 +100,17 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch):
assert r3.status_code == 200 assert r3.status_code == 200
assert "pending" in r3.json() assert "pending" in r3.json()
r3b = c.get("/api/v2/tasks?limit=10", headers=headers)
assert r3b.status_code == 200
assert any(t.get("task_id") == "tid1" for t in r3b.json().get("tasks", []))
r3c = c.get("/api/v2/tasks?limit=10&offset=0&states=QUEUED", headers=headers)
assert r3c.status_code == 200
assert all(t.get("state") == "QUEUED" for t in r3c.json().get("tasks", []))
r3d = c.get("/api/v2/tasks?states=NOPE", headers=headers)
assert r3d.status_code == 400
r4 = c.post("/api/v2/tasks/tid1:cancel", headers=headers) r4 = c.post("/api/v2/tasks/tid1:cancel", headers=headers)
assert r4.status_code == 200 assert r4.status_code == 200
assert r4.json()["state"] == "CANCELED" assert r4.json()["state"] == "CANCELED"
@ -118,6 +134,14 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch):
db.create_attempt(task_id="tid2", attempt_no=1, ray_submission_id="sid2") db.create_attempt(task_id="tid2", attempt_no=1, ray_submission_id="sid2")
db.set_task_state(task_id="tid2", state="RUNNING", latest_attempt_no=1) db.set_task_state(task_id="tid2", state="RUNNING", latest_attempt_no=1)
r6 = c.get("/api/v2/tasks?limit=1&offset=0&states=RUNNING", headers=headers)
assert r6.status_code == 200
assert any(t.get("task_id") == "tid2" for t in r6.json().get("tasks", []))
r7 = c.get("/api/v2/tasks?limit=1&offset=1&states=RUNNING", headers=headers)
assert r7.status_code == 200
assert "has_more" in r7.json()
r5 = c.get("/api/v2/tasks/tid2/logs?tail=1", headers=headers) r5 = c.get("/api/v2/tasks/tid2/logs?tail=1", headers=headers)
assert r5.status_code == 200 assert r5.status_code == 200
assert r5.text.strip() == "c" assert r5.text.strip() == "c"
@ -163,3 +187,102 @@ def test_submit_rejects_invalid_jobspec(tmp_path: Path, monkeypatch):
with TestClient(app) as c: with TestClient(app) as c:
r = c.post("/api/v2/tasks", headers={"authorization": "Bearer token1"}, data="workload: nope\n") r = c.post("/api/v2/tasks", headers={"authorization": "Bearer token1"}, data="workload: nope\n")
assert r.status_code == 400 assert r.status_code == 400
def test_me_sftp_reset_password_disabled_returns_400(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
# seed user + token
from argus.service.config import V2Config
from argus.service.db import Db
root = yaml.safe_load(cfg_path.read_text(encoding="utf-8"))
v2_cfg = V2Config.from_root_dict(root)
db = Db(v2_cfg.sqlite.db_path)
db.init()
db.create_user(user_id="u1", display_name=None)
token = db.issue_token(user_id="u1")
with TestClient(app) as c:
r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"})
assert r.status_code == 400
def test_me_sftp_reset_password_enabled_returns_password(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg = yaml.safe_load(_write_config(tmp_path).read_text(encoding="utf-8"))
cfg["data"]["sftpgo"] = {
"enabled": True,
"host": "127.0.0.1",
"sftp_port": 2022,
"admin_api_base": "http://127.0.0.1:8081",
"admin_user": "admin",
"admin_password_env": "SFTPGO_ADMIN_PASSWORD",
}
cfg_path = tmp_path / "cfg_sftp.yaml"
cfg_path.write_text(yaml.safe_dump(cfg), encoding="utf-8")
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1")
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
class _FakeSFTPGo:
def __init__(self, **kwargs):
self.reset = []
self.enabled = []
def reset_password(self, username: str, new_password: str, home_dir: str):
assert username
assert new_password
assert home_dir
self.reset.append((username, home_dir))
def enable_user(self, username: str, home_dir: str):
self.enabled.append((username, home_dir))
fake_client = _FakeSFTPGo()
class _FakeSFTPGoFactory:
def __call__(self, **kwargs):
return fake_client
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _FakeSFTPGoFactory())
app = app_mod.create_app(str(cfg_path))
# seed user in DB
from argus.service.db import Db
from argus.service.config import V2Config
v2_cfg = V2Config.from_root_dict(cfg)
db = Db(v2_cfg.sqlite.db_path)
db.init()
db.create_user(user_id="u1", display_name=None)
token = db.issue_token(user_id="u1")
with TestClient(app) as c:
r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"})
assert r.status_code == 200
j = r.json()
assert j["user_id"] == "u1"
assert isinstance(j["password"], str) and len(j["password"]) >= 8

View File

@ -0,0 +1,225 @@
from __future__ import annotations
from datetime import datetime, timedelta, timezone
from pathlib import Path
from argus.service.db import Db
from argus.service.janitor import JobsJanitor
def _iso_z(dt: datetime) -> str:
return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
def _mk_job_dir(user_root: Path, user_id: str, sid: str) -> Path:
p = user_root / user_id / "jobs" / sid
p.mkdir(parents=True, exist_ok=True)
(p / "marker.txt").write_text("x", encoding="utf-8")
return p
def _mk_trash_dir(user_root: Path, user_id: str, sid: str) -> Path:
p = user_root / user_id / "trash" / "jobs" / sid
p.mkdir(parents=True, exist_ok=True)
(p / "marker.txt").write_text("x", encoding="utf-8")
return p
def test_janitor_moves_jobs_to_trash_after_3_days(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t1"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=4)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
src = _mk_job_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert not src.exists()
dst = user_root / user_id / "trash" / "jobs" / sid
assert dst.exists()
assert (dst / "marker.txt").read_text(encoding="utf-8") == "x"
def test_janitor_purges_from_trash_after_7_days(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t2"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=8)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED")
_mk_trash_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
dst = user_root / user_id / "trash" / "jobs" / sid
assert not dst.exists()
def test_janitor_does_not_touch_recent_jobs(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t3"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="grpo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=1)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED")
src = _mk_job_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert src.exists()
assert not (user_root / user_id / "trash" / "jobs" / sid).exists()
def test_janitor_skips_tasks_without_user_id(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "legacy"
sid = "sid-legacy"
db.create_task(task_id=task_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=10)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
# Even if someone created a matching directory under user_root, janitor shouldn't touch it because user_id is NULL.
src = _mk_job_dir(user_root, "alice", sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert src.exists()
def test_janitor_validates_retention_days(tmp_path: Path) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
try:
JobsJanitor(db=db, user_root="/tmp", trash_after_days=-1, purge_after_days=7, interval_s=1)
raise AssertionError("expected ValueError")
except ValueError:
pass
try:
JobsJanitor(db=db, user_root="/tmp", trash_after_days=3, purge_after_days=1, interval_s=1)
raise AssertionError("expected ValueError")
except ValueError:
pass
def test_janitor_noop_when_disabled(tmp_path: Path) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
user_root = tmp_path / "users"
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=0, purge_after_days=0, interval_s=1)
jan.tick_once(now=datetime(2025, 1, 1, tzinfo=timezone.utc))
def test_janitor_handles_invalid_end_time_and_missing_fields(tmp_path: Path) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
cutoff_ended = now - timedelta(days=10)
# Missing end_time (empty string) => should be skipped.
db.create_task_v25(task_id="t4", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id="t4", attempt_no=1, ray_submission_id="sid-empty")
db.update_attempt(task_id="t4", attempt_no=1, end_time="", ray_status="SUCCEEDED")
# Invalid ISO but still lexicographically <= cutoff => should be skipped.
db.create_task_v25(task_id="t5", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id="t5", attempt_no=1, ray_submission_id="sid-bad")
db.update_attempt(task_id="t5", attempt_no=1, end_time="2025-01-01T00:00:00ZZ", ray_status="FAILED")
_mk_job_dir(user_root, "alice", "sid-empty")
_mk_job_dir(user_root, "alice", "sid-bad")
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=cutoff_ended + timedelta(days=10))
assert (user_root / "alice" / "jobs" / "sid-empty").exists()
assert (user_root / "alice" / "jobs" / "sid-bad").exists()
def test_janitor_purge_moves_from_jobs_then_deletes(tmp_path: Path, monkeypatch) -> None:
db_path = tmp_path / "mvp.sqlite3"
user_root = tmp_path / "users"
db = Db(str(db_path))
db.init()
task_id = "t6"
user_id = "alice"
sid = "sid-a01"
db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1)
db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid)
now = datetime(2025, 1, 10, tzinfo=timezone.utc)
ended = now - timedelta(days=9)
db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED")
src = _mk_job_dir(user_root, user_id, sid)
jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1)
jan.tick_once(now=now)
assert not src.exists()
assert not (user_root / user_id / "trash" / "jobs" / sid).exists()
def test_janitor_run_forever_requires_event_like(tmp_path: Path) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1)
try:
jan.run_forever(object())
raise AssertionError("expected ValueError")
except ValueError:
pass
def test_janitor_run_forever_survives_tick_exceptions(tmp_path: Path, monkeypatch) -> None:
db = Db(str(tmp_path / "mvp.sqlite3"))
db.init()
jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1)
class Flag:
def __init__(self) -> None:
self.n = 0
def is_set(self) -> bool:
self.n += 1
return self.n >= 2
monkeypatch.setattr(jan, "tick_once", lambda **_: (_ for _ in ()).throw(RuntimeError("boom")))
monkeypatch.setattr("argus.service.janitor.time.sleep", lambda *_: None)
jan.run_forever(Flag())

View File

@ -38,3 +38,18 @@ def test_v2_config_requires_mappings():
V2Config.from_root_dict({"service": ["nope"]}) V2Config.from_root_dict({"service": ["nope"]})
with pytest.raises(ValueError, match="config\\.service\\.\\{api,auth,sqlite,scheduler\\} must be mappings"): with pytest.raises(ValueError, match="config\\.service\\.\\{api,auth,sqlite,scheduler\\} must be mappings"):
V2Config.from_root_dict({"service": {"api": [1], "auth": {}, "sqlite": {}, "scheduler": {}}}) V2Config.from_root_dict({"service": {"api": [1], "auth": {}, "sqlite": {}, "scheduler": {}}})
def test_v2_config_requires_data_mappings():
from argus.service.config import V2Config
base = {
"ray": {"shared_root": "/private"},
"service": {"api": {}, "auth": {}, "sqlite": {}, "scheduler": {}},
}
with pytest.raises(ValueError, match="config\\.data must be a mapping"):
V2Config.from_root_dict({**base, "data": ["nope"]})
with pytest.raises(ValueError, match="config\\.data\\.\\{sftpgo,retention\\} must be mappings"):
V2Config.from_root_dict({**base, "data": {"sftpgo": ["x"], "retention": {}}})

View File

@ -0,0 +1,322 @@
from __future__ import annotations
from pathlib import Path
import yaml
from fastapi.testclient import TestClient
def _write_config(tmp_path: Path, *, enabled: bool) -> Path:
cfg = {
"ray": {
"address": "http://127.0.0.1:8265",
"shared_root": "/private",
"entrypoint_resources": {"worker_node": 1},
"runtime_env": {"env_vars": {}},
},
"data": {
"user_root": str(tmp_path / "users"),
"sftpgo": {
"enabled": bool(enabled),
"admin_api_base": "http://127.0.0.1:8081/api/v2",
"admin_user": "admin",
"admin_password_env": "SFTPGO_ADMIN_PASSWORD",
"host": "h1.internal",
"sftp_port": 2022,
},
},
"service": {
"api": {"host": "127.0.0.1", "port": 0},
"auth": {"token_env": "MVP_INTERNAL_TOKEN"},
"sqlite": {"db_path": str(tmp_path / "mvp.sqlite3")},
"scheduler": {"tick_s": 1, "retry_interval_s": 1, "max_running_tasks": 1},
},
}
p = tmp_path / "cfg.yaml"
p.write_text(yaml.safe_dump(cfg), encoding="utf-8")
return p
def test_create_user_calls_sftpgo_when_enabled(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path, enabled=True)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
calls = {"create": [], "disable": [], "reset": []}
class _Client:
def __init__(self, **kwargs):
pass
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
calls["create"].append((username, password, home_dir))
def enable_user(self, *, username: str, home_dir: str) -> None:
# Not used in this test, but required by app for idempotent upsert.
return None
def disable_user(self, *, username: str, home_dir: str) -> None:
calls["disable"].append(username)
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
calls["reset"].append((username, new_password))
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client)
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"})
assert r.status_code == 200
assert calls["create"]
username, password, home_dir = calls["create"][-1]
assert username == "alice"
assert password
assert home_dir.endswith("/users/alice")
r2 = c.post("/api/v2/users/alice:disable", headers=admin_headers)
assert r2.status_code == 200
assert calls["disable"] == ["alice"]
r3 = c.post("/api/v2/users/alice/sftp:reset_password", headers=admin_headers)
assert r3.status_code == 200
body = r3.json()
assert body["user_id"] == "alice"
assert body["password"]
assert calls["reset"] and calls["reset"][-1][0] == "alice"
def test_create_user_upserts_when_sftpgo_user_already_exists(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
from argus.service.sftpgo import SFTPGoError
cfg_path = _write_config(tmp_path, enabled=True)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1")
calls = {"create": 0, "reset": [], "enable": []}
class _Client:
def __init__(self, **kwargs):
pass
def create_user(self, *, username: str, password: str, home_dir: str) -> None:
calls["create"] += 1
raise SFTPGoError("sftpgo http error: 409 Conflict")
def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None:
calls["reset"].append((username, new_password))
def enable_user(self, *, username: str, home_dir: str) -> None:
calls["enable"].append(username)
def disable_user(self, *, username: str, home_dir: str) -> None:
return None
monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client)
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"})
assert r.status_code == 200
body = r.json()
assert body["user_id"] == "bob"
assert body.get("sftp", {}).get("password")
assert calls["create"] == 1
assert calls["reset"] and calls["reset"][-1][0] == "bob"
assert calls["enable"] == ["bob"]
def test_sftpgo_enabled_requires_admin_password_env(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path, enabled=True)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
monkeypatch.delenv("SFTPGO_ADMIN_PASSWORD", raising=False)
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"})
assert r.status_code == 500
assert "SFTPGO_ADMIN_PASSWORD" in r.text
def test_sftpgo_admin_client_builds_requests(monkeypatch):
import json
from urllib.request import Request
from argus.service.sftpgo import SFTPGoAdminClient
import argus.service.sftpgo as mod
seen: list[Request] = []
class _Resp:
def __init__(self, body: bytes = b"ok"):
self._body = body
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return self._body
def fake_urlopen(req, timeout=0):
seen.append(req)
if req.full_url.endswith("/token"):
return _Resp(body=b'{"access_token":"t1","expires_at":0}')
# allow folder creates to be idempotent in tests
if req.full_url.endswith("/folders") and req.get_method() == "POST":
return _Resp(body=b'{"message":"created"}')
if req.full_url.endswith("/users/alice") and req.get_method() == "GET":
return _Resp(
body=b'{"username":"alice","status":1,"home_dir":"/private/users/alice","uid":0,"gid":0,"permissions":{"/":["*"]},"virtual_folders":[]}'
)
return _Resp()
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common")
c.create_user(username="alice", password="p1", home_dir="/private/users/alice")
c.disable_user(username="alice", home_dir="/private/users/alice")
c.enable_user(username="alice", home_dir="/private/users/alice")
c.reset_password(username="alice", new_password="p2", home_dir="/private/users/alice")
# Each operation fetches a token, then issues a request.
# For create_user: token + ensure folders (2 POSTs) + create user
# For disable/enable/reset: token + ensure folders (2 POSTs) + GET user + PUT user
assert len(seen) == 1 + 2 + 1 + 3 * (1 + 2 + 1 + 1)
assert seen[0].full_url.endswith("/token")
assert seen[0].headers.get("Authorization", "").startswith("Basic ")
# create_user
assert seen[1].full_url.endswith("/folders")
assert seen[2].full_url.endswith("/folders")
assert seen[3].full_url.endswith("/users")
assert seen[3].headers.get("Authorization", "") == "Bearer t1"
created = json.loads(seen[3].data.decode("utf-8"))
assert created["username"] == "alice"
assert "/common" in created.get("permissions", {})
assert "/common/datasets" in created.get("permissions", {})
assert "/common/hf" in created.get("permissions", {})
# disable_user
assert seen[4].full_url.endswith("/token")
assert seen[5].full_url.endswith("/folders")
assert seen[6].full_url.endswith("/folders")
assert seen[7].full_url.endswith("/users/alice")
assert seen[7].get_method() == "GET"
assert seen[8].full_url.endswith("/users/alice")
assert seen[8].get_method() == "PUT"
# enable_user
assert seen[9].full_url.endswith("/token")
assert seen[10].full_url.endswith("/folders")
assert seen[11].full_url.endswith("/folders")
assert seen[12].full_url.endswith("/users/alice")
assert seen[12].get_method() == "GET"
assert seen[13].full_url.endswith("/users/alice")
assert seen[13].get_method() == "PUT"
# reset_password
assert seen[14].full_url.endswith("/token")
assert seen[15].full_url.endswith("/folders")
assert seen[16].full_url.endswith("/folders")
assert seen[17].full_url.endswith("/users/alice")
assert seen[17].get_method() == "GET"
assert seen[18].full_url.endswith("/users/alice")
assert seen[18].get_method() == "PUT"
def test_sftpgo_admin_client_http_error_raises(monkeypatch):
from urllib.error import HTTPError
from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError
import argus.service.sftpgo as mod
def fake_urlopen(req, timeout=0):
if req.full_url.endswith("/token"):
class _Resp:
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return b'{"access_token":"t1","expires_at":0}'
return _Resp()
raise HTTPError(req.full_url, 500, "boom", hdrs=None, fp=None)
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common")
try:
c.create_user(username="alice", password="p1", home_dir="/private/users/alice")
assert False, "expected SFTPGoError"
except SFTPGoError as e:
assert "http error" in str(e)
def test_sftpgo_admin_client_url_error_raises(monkeypatch):
from urllib.error import URLError
from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError
import argus.service.sftpgo as mod
def fake_urlopen(req, timeout=0):
if req.full_url.endswith("/token"):
class _Resp:
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return b'{"access_token":"t1","expires_at":0}'
return _Resp()
raise URLError("no route")
monkeypatch.setattr(mod, "urlopen", fake_urlopen)
c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw")
try:
c.disable_user(username="alice", home_dir="/private/users/alice")
assert False, "expected SFTPGoError"
except SFTPGoError as e:
assert "connection error" in str(e)

View File

@ -0,0 +1,78 @@
from __future__ import annotations
from pathlib import Path
from fastapi.testclient import TestClient
from argus.service.app import create_app
def _write_config(tmp_path: Path) -> Path:
p = tmp_path / "cfg.yaml"
p.write_text(
"""
ray:
address: "http://127.0.0.1:8265"
shared_root: "/private"
entrypoint_num_cpus: 1
entrypoint_resources: { worker_node: 1 }
runtime_env: { env_vars: { PYTHONUNBUFFERED: "1" } }
service:
api: { host: "127.0.0.1", port: 8080 }
auth: { token_env: "MVP_INTERNAL_TOKEN" }
sqlite: { db_path: "%(db)s" }
data:
user_root: "%(users)s"
sftpgo: { enabled: false }
retention: { jobs_trash_after_days: 3, jobs_purge_after_days: 7, janitor_interval_s: 3600 }
"""
% {"db": str(tmp_path / "mvp.sqlite3"), "users": str(tmp_path / "users")}
)
return p
def test_ui_routes_render_200(tmp_path, monkeypatch):
cfg = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
app = create_app(str(cfg))
c = TestClient(app)
for path in (
"/ui",
"/ui/login",
"/ui/tasks",
"/ui/tasks/new",
"/ui/data",
"/ui/admin",
"/ui/tasks/any-task-id",
"/ui/tasks/any-task-id/logs",
):
r = c.get(path, allow_redirects=True)
assert r.status_code == 200
assert "<html" in r.text.lower()
def test_ui_contains_sidebar_links(tmp_path, monkeypatch):
cfg = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
app = create_app(str(cfg))
c = TestClient(app)
r = c.get("/ui/tasks")
assert r.status_code == 200
for link in ("/ui/tasks", "/ui/tasks/new", "/ui/data", "/ui/login", "/ui/admin"):
assert link in r.text
assert "Ray Dashboard" in r.text
def test_ui_task_detail_shows_ids(tmp_path, monkeypatch):
cfg = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token")
app = create_app(str(cfg))
c = TestClient(app)
task_id = "mvp3-ppo-20250101-000000-aaaa"
r = c.get(f"/ui/tasks/{task_id}")
assert r.status_code == 200
assert task_id in r.text
assert f"/ui/tasks/{task_id}/logs" in r.text

View File

@ -14,6 +14,11 @@ def _write_config(tmp_path: Path) -> Path:
"entrypoint_resources": {"worker_node": 1}, "entrypoint_resources": {"worker_node": 1},
"runtime_env": {"env_vars": {}}, "runtime_env": {"env_vars": {}},
}, },
"data": {
# Avoid touching real /private in tests. Keep ray.shared_root as /private
# so existing path validation tests remain unchanged.
"user_root": str(tmp_path / "users"),
},
"service": { "service": {
"api": {"host": "127.0.0.1", "port": 0}, "api": {"host": "127.0.0.1", "port": 0},
"auth": {"token_env": "MVP_INTERNAL_TOKEN"}, "auth": {"token_env": "MVP_INTERNAL_TOKEN"},
@ -59,6 +64,9 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon
admin_headers = {"authorization": "Bearer adm1"} admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c: with TestClient(app) as c:
# list users requires admin
assert c.get("/api/v2/users", headers={"authorization": "Bearer nope"}).status_code in (401, 403)
r1 = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice", "display_name": "Alice"}) r1 = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice", "display_name": "Alice"})
assert r1.status_code == 200 assert r1.status_code == 200
assert r1.json()["user_id"] == "alice" assert r1.json()["user_id"] == "alice"
@ -68,6 +76,11 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon
token = r2.json()["token"] token = r2.json()["token"]
assert token assert token
r2b = c.get("/api/v2/users", headers=admin_headers)
assert r2b.status_code == 200
users = r2b.json()["users"]
assert any(u.get("user_id") == "alice" for u in users)
# non-admin token can access regular endpoints # non-admin token can access regular endpoints
r3 = c.get("/api/v2/queue", headers={"authorization": f"Bearer {token}"}) r3 = c.get("/api/v2/queue", headers={"authorization": f"Bearer {token}"})
assert r3.status_code == 200 assert r3.status_code == 200
@ -177,3 +190,165 @@ def test_submit_rejects_non_common_inputs(tmp_path: Path, monkeypatch):
) )
assert r.status_code == 400 assert r.status_code == 400
assert "code_path must start with /private/common/" in r.text assert "code_path must start with /private/common/" in r.text
def test_submit_accepts_user_dataset_paths_and_local_model_paths(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
alice_headers = {"authorization": f"Bearer {alice_tok}"}
# User dataset paths are allowed.
r1 = c.post(
"/api/v2/tasks",
headers=alice_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: Qwen/Qwen2.5-0.5B-Instruct\n"
"train_file: /private/users/alice/datasets/t\n"
),
)
assert r1.status_code == 200
# Local model paths under user models/ are allowed (no TaskSpec schema change).
r2 = c.post(
"/api/v2/tasks",
headers=alice_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: /private/users/alice/models/m1\n"
"train_file: /private/common/datasets/t\n"
),
)
assert r2.status_code == 200
def test_submit_rejects_cross_user_paths_and_bad_local_model_dirs(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"}).status_code == 200
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
bob_tok = c.post("/api/v2/users/bob/tokens", headers=admin_headers).json()["token"]
bob_headers = {"authorization": f"Bearer {bob_tok}"}
# Cross-user dataset path should be rejected.
r1 = c.post(
"/api/v2/tasks",
headers=bob_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: Qwen/Qwen2.5-0.5B-Instruct\n"
"train_file: /private/users/alice/datasets/t\n"
),
)
assert r1.status_code == 400
assert "/private/users/bob/datasets/" in r1.text
# Local model path must be under models/.
r2 = c.post(
"/api/v2/tasks",
headers=bob_headers,
data=(
"workload: ppo\n"
"code_path: /private/common/code/verl\n"
"model_id: /private/users/bob/jobs/j1/checkpoints\n"
"train_file: /private/common/datasets/t\n"
),
)
assert r2.status_code == 400
assert "model_id local path must start with" in r2.text
def test_me_returns_paths_and_retention(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"]
r = c.get("/api/v2/me", headers={"authorization": f"Bearer {alice_tok}"})
assert r.status_code == 200
obj = r.json()
assert obj["user_id"] == "alice"
assert obj["paths"]["home"].endswith("/users/alice")
assert obj["paths"]["jobs"].endswith("/users/alice/jobs")
assert obj["paths"]["trash_jobs"].endswith("/users/alice/trash/jobs")
assert obj["retention"]["jobs_trash_after_days"] == 3
assert obj["retention"]["jobs_purge_after_days"] == 7
def test_create_user_creates_user_dirs(tmp_path: Path, monkeypatch):
from argus.service import app as app_mod
cfg_path = _write_config(tmp_path)
monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1")
class _Scheduler:
def __init__(self, **kwargs):
self.tool = object()
def run_forever(self, stop_flag):
return None
monkeypatch.setattr(app_mod, "Scheduler", _Scheduler)
app = app_mod.create_app(str(cfg_path))
admin_headers = {"authorization": "Bearer adm1"}
with TestClient(app) as c:
assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200
base = tmp_path / "users" / "alice"
assert (base / "datasets").is_dir()
assert (base / "models").is_dir()
assert (base / "code").is_dir()
assert (base / "jobs").is_dir()
assert (base / "trash" / "jobs").is_dir()

View File

@ -9,4 +9,4 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${SCRIPT_DIR}/lib.sh" source "${SCRIPT_DIR}/lib.sh"
dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -U pip >/dev/null 2>&1 || true" dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -U pip >/dev/null 2>&1 || true"
dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -r /workspace/mvp/py/requirements.txt" dexec "${HEAD_CONTAINER}" bash -lc "python3 -c 'import fastapi,uvicorn,yaml' >/dev/null 2>&1 && echo 'api_deps_ok: skip' || (python3 -m pip install -r /workspace/mvp/py/requirements.txt || echo 'WARN: api deps install failed; continuing with preinstalled deps')"

View File

@ -22,7 +22,12 @@ if [[ -z "${MVP_INTERNAL_TOKEN:-}" ]]; then
exit 1 exit 1
fi fi
docker exec -d -e MVP_INTERNAL_TOKEN="${MVP_INTERNAL_TOKEN}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'" env_args=(-e "MVP_INTERNAL_TOKEN=${MVP_INTERNAL_TOKEN}")
if [[ -n "${SFTPGO_ADMIN_PASSWORD:-}" ]]; then
env_args+=(-e "SFTPGO_ADMIN_PASSWORD=${SFTPGO_ADMIN_PASSWORD}")
fi
docker exec -d "${env_args[@]}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'"
echo "[host] started; pid stored in ${PID_PATH} (container path)" echo "[host] started; pid stored in ${PID_PATH} (container path)"
echo "[host] logs: ${LOG_PATH} (container path)" echo "[host] logs: ${LOG_PATH} (container path)"

View File

@ -0,0 +1,242 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=lib.sh
source "${SCRIPT_DIR}/lib.sh"
# E2E v3.0:
# - Start Ray (head + stateless workers) + SFTPGo (compose)
# - Start API server with v3.0 config
# - Create user (API triggers SFTPGo user create) and return one-time SFTP password
# - (Optional) verify SFTP login
# - Submit PPO/GRPO/SFT referencing user dataset paths and wait
API_ADDR="${API_ADDR:-http://127.0.0.1:8080}"
ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-}"
USER_ID="${USER_ID:-alice}"
RESET_DB="${RESET_DB:-1}"
RESET_SFTPGO="${RESET_SFTPGO:-0}"
EXPECTED_RAY_NODES="${EXPECTED_RAY_NODES:-3}" # head + 2 workers
CLUSTER_NAME="${CLUSTER_NAME:-argus-ray}"
CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/configs/dev_v30.yaml}"
SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
export SFTPGO_ADMIN_PASSWORD
if [[ -z "${ADMIN_TOKEN}" ]]; then
echo "ERROR: MVP_INTERNAL_TOKEN must be set in host env (admin token)" >&2
exit 1
fi
api_curl_admin() {
curl -sS -H "Authorization: Bearer ${ADMIN_TOKEN}" "$@"
}
api_wait_ready() {
local tries="${1:-60}"
for i in $(seq 1 "${tries}"); do
if curl -sS -m 2 "${API_ADDR}/docs" >/dev/null 2>&1; then
echo "[host] api_ready: ${API_ADDR}"
return 0
fi
echo "[host] waiting api... (${i}/${tries})"
sleep 2
done
echo "ERROR: api not ready: ${API_ADDR}" >&2
return 1
}
sftpgo_wait_ready() {
local tries="${1:-60}"
local url="${2:-http://127.0.0.1:8081/api/v2/token}"
for i in $(seq 1 "${tries}"); do
# SFTPGo admin endpoints require auth; readiness means "HTTP reachable and can issue token".
if curl -sS -m 2 -u "admin:${SFTPGO_ADMIN_PASSWORD}" "${url}" >/dev/null 2>&1; then
echo "[host] sftpgo_ready: ${url} (token ok)"
return 0
fi
echo "[host] waiting sftpgo... (${i}/${tries})"
sleep 2
done
echo "ERROR: sftpgo not ready: ${url}" >&2
return 1
}
ray_wait_ready() {
local tries="${1:-60}"
for i in $(seq 1 "${tries}"); do
if curl -sS -m 2 "${RAY_DASHBOARD_ADDR}/api/version" >/dev/null 2>&1; then
echo "[host] ray_dashboard_ready: ${RAY_DASHBOARD_ADDR}"
return 0
fi
echo "[host] waiting ray dashboard... (${i}/${tries})"
sleep 2
done
echo "ERROR: ray dashboard not ready: ${RAY_DASHBOARD_ADDR}" >&2
return 1
}
ray_wait_nodes() {
local want="${1:-3}"
local tries="${2:-60}"
for i in $(seq 1 "${tries}"); do
local out n
out="$(docker exec -i "${HEAD_CONTAINER}" python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False, logging_level='ERROR'); print(sum(1 for n in ray.nodes() if n.get('Alive')))" 2>/dev/null || true)"
n="$(printf '%s\n' "${out}" | tail -n 1 | tr -cd '0-9' || true)"
if [[ "${n}" =~ ^[0-9]+$ ]]; then
echo "[host] ray_nodes_alive=${n} (want>=${want})"
if [[ "${n}" -ge "${want}" ]]; then
return 0
fi
else
echo "[host] waiting ray nodes... (${i}/${tries})"
fi
sleep 2
done
echo "ERROR: ray nodes not ready (want>=${want})" >&2
docker exec -i "${HEAD_CONTAINER}" bash -lc "ray status || true" >&2 || true
return 1
}
submit_taskspec_inline() {
local token="$1"
local yaml_body="$2"
local resp
resp="$(curl -sS -H "Authorization: Bearer ${token}" -H "Content-Type: application/yaml" --data-binary "${yaml_body}" "${API_ADDR}/api/v2/tasks")"
echo "[host] submit_resp: ${resp}" >&2
printf '%s' "${resp}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["task_id"])'
}
wait_task() {
local token="$1"
local task_id="$2"
while true; do
local body state
body="$(curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}")"
state="$(printf '%s' "${body}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["state"])')"
echo "[host] task ${task_id}: ${state}"
if [[ "${state}" == "SUCCEEDED" ]]; then
return 0
fi
if [[ "${state}" == "FAILED" || "${state}" == "CANCELED" ]]; then
echo "[host] terminal=${state}; tail logs (best-effort):" >&2
curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}/logs?tail=200" >&2 || true
return 1
fi
sleep 10
done
}
echo "[host] ===== run_all_v30_api.sh begin ====="
"${SCRIPT_DIR}/00_prereq_check.sh"
"${SCRIPT_DIR}/03_cleanup_v1_legacy.sh"
"${SCRIPT_DIR}/04_cleanup_v2_legacy.sh"
echo "[host] bring down existing containers (best-effort)"
"${SCRIPT_DIR}/02_down.sh" || true
if [[ "${RESET_SFTPGO}" == "1" ]]; then
echo "[host] reset sftpgo metadata dir (best-effort, via helper container)"
SFTPGO_META_DIR="${ROOT_DIR}/../../shared/common/sftpgo"
mkdir -p "${SFTPGO_META_DIR}"
docker run --rm --entrypoint sh -u 0:0 -v "${SFTPGO_META_DIR}:/mnt" drakkan/sftpgo:latest -lc "rm -rf /mnt/* || true"
fi
echo "[host] (re)create containers (Ray + SFTPGo)"
"${SCRIPT_DIR}/01_up.sh"
echo "[host] wait ray head ready"
ray_wait_ready 60
echo "[host] wait sftpgo ready"
sftpgo_wait_ready 60 "http://127.0.0.1:8081/api/v2/token"
echo "[host] render v3.0 config with SFTPGo container IP (work around docker DNS issues)"
SFTPGO_IP="$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' argus-sftpgo)"
RENDERED_CFG_HOST_PATH="/tmp/dev_v30_rendered.yaml"
sed -E "s#^(\\s*admin_api_base:) .*#\\1 \"http://${SFTPGO_IP}:8080/api/v2\"#g" "${ROOT_DIR}/configs/dev_v30.yaml" >"${RENDERED_CFG_HOST_PATH}"
docker cp "${RENDERED_CFG_HOST_PATH}" "${HEAD_CONTAINER}:/tmp/dev_v30_rendered.yaml"
CONFIG_IN_CONTAINER="/tmp/dev_v30_rendered.yaml"
echo "[host] verify head discovery record (supervised in-container)"
HEAD_IP_FILE="${SHARED_ROOT}/ray/discovery/${CLUSTER_NAME}/head.json"
dexec "${HEAD_CONTAINER}" bash -lc "test -f '${HEAD_IP_FILE}' && python3 -c 'import json,sys; print(json.load(open(sys.argv[1]))[\"job_server_url\"])' '${HEAD_IP_FILE}' || true"
echo "[host] wait workers join"
ray_wait_nodes "${EXPECTED_RAY_NODES}" 120
echo "[host] prepare data/model (idempotent; reuse cache)"
"${SCRIPT_DIR}/30_prepare_data_and_model.sh"
echo "[host] install api deps in head container"
"${SCRIPT_DIR}/12_install_api_deps.sh"
echo "[host] stop api (best-effort)"
"${SCRIPT_DIR}/61_stop_api.sh" || true
if [[ "${RESET_DB}" == "1" ]]; then
echo "[host] reset api sqlite db in container (best-effort)"
docker exec -i "${HEAD_CONTAINER}" bash -lc "rm -f /private/common/db/mvp.sqlite3 /private/common/db/mvp.sqlite3-wal /private/common/db/mvp.sqlite3-shm || true"
else
echo "[host] keep existing api sqlite db (RESET_DB=${RESET_DB})"
fi
echo "[host] start api (admin token + sftpgo admin password via env)"
MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER}" SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD}" "${SCRIPT_DIR}/60_start_api.sh"
api_wait_ready 60
echo "[host] create user (expect SFTP one-time password in response)"
create_resp="$(api_curl_admin -H "Content-Type: application/json" -d "{\"user_id\":\"${USER_ID}\"}" "${API_ADDR}/api/v2/users")"
echo "[host] create_user_resp: ${create_resp}"
USER_SFTP_PASSWORD="$(printf '%s' "${create_resp}" | python3 -c 'import sys,json; o=json.load(sys.stdin); print((o.get("sftp") or {}).get("password") or "")')"
if [[ -z "${USER_SFTP_PASSWORD}" ]]; then
echo "ERROR: expected sftp.password in create user response (is data.sftpgo.enabled=true?)" >&2
exit 1
fi
echo "[host] issue user token"
USER_TOKEN="$(api_curl_admin -X POST "${API_ADDR}/api/v2/users/${USER_ID}/tokens" | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')"
echo "[host] user_token_issued: user=${USER_ID}"
echo "[host] (optional) verify SFTP login (best-effort)"
if command -v sshpass >/dev/null 2>&1 && command -v sftp >/dev/null 2>&1; then
tmp_batch="$(mktemp)"
cat >"${tmp_batch}" <<EOF
pwd
ls
EOF
# NOTE: this just proves auth works; dataset upload can be done via SFTP manually later.
sshpass -p "${USER_SFTP_PASSWORD}" sftp -o StrictHostKeyChecking=no -P 2022 "${USER_ID}@127.0.0.1" -b "${tmp_batch}" >/dev/null 2>&1 || true
rm -f "${tmp_batch}" || true
else
echo "[host] skip: sshpass/sftp not found; you can test manually with: sftp -P 2022 ${USER_ID}@<host>"
fi
echo "[host] ensure user dataset paths exist (copy from common if needed; best-effort)"
echo "[host] copy dataset into user datasets path inside head container (avoid host permission issues)"
dexec "${HEAD_CONTAINER}" bash -lc "set -euo pipefail; \
mkdir -p '/private/users/${USER_ID}/datasets/gsm8k' '/private/users/${USER_ID}/datasets/gsm8k_sft'; \
(cp -f /private/common/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || true)"
echo "[host] submit PPO/GRPO/SFT via API using user dataset paths"
PPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: ppo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
GRPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: grpo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
SFT_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: sft\nnnodes: 1\nn_gpus_per_node: 1\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
echo "[host] submitted task ids:"
echo " ppo=${PPO_TASK_ID}"
echo " grpo=${GRPO_TASK_ID}"
echo " sft=${SFT_TASK_ID}"
echo "[host] wait for tasks (in submission order)"
wait_task "${USER_TOKEN}" "${PPO_TASK_ID}"
wait_task "${USER_TOKEN}" "${GRPO_TASK_ID}"
wait_task "${USER_TOKEN}" "${SFT_TASK_ID}"
echo "[host] ===== run_all_v30_api.sh done ====="

View File

@ -0,0 +1,19 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Minimal E2E runner for v3.0:
# - Happy path: Ray + SFTPGo + API + 3 workloads
# - Leaves retention validation as a manual follow-up (adjust thresholds).
ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-my-dev-token}"
USER_ID="${USER_ID:-alice}"
echo "[case] HP-1: run_all_v30_api.sh (Ray + SFTPGo + API + 3 workloads)"
MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" USER_ID="${USER_ID}" RESET_DB=1 RESET_SFTPGO=0 "${SCRIPT_DIR}/run_all_v30_api.sh"
echo "[case] NOTE: retention validation (C2) is manual:"
echo " - set data.retention.jobs_trash_after_days / jobs_purge_after_days to small values in configs/dev_v30.yaml"
echo " - restart API server and wait for janitor"