diff --git a/specs/mvp/image/Snipaste_2025-12-30_16-06-37.png b/specs/mvp/image/Snipaste_2025-12-30_16-06-37.png new file mode 100644 index 0000000..fb62750 Binary files /dev/null and b/specs/mvp/image/Snipaste_2025-12-30_16-06-37.png differ diff --git a/specs/mvp/image/Snipaste_2025-12-30_17-02-20.png b/specs/mvp/image/Snipaste_2025-12-30_17-02-20.png new file mode 100644 index 0000000..4005c93 Binary files /dev/null and b/specs/mvp/image/Snipaste_2025-12-30_17-02-20.png differ diff --git a/specs/mvp/image/roadmap_v3.0.png b/specs/mvp/image/roadmap_v3.0.png new file mode 100644 index 0000000..8424345 Binary files /dev/null and b/specs/mvp/image/roadmap_v3.0.png differ diff --git a/specs/mvp/v3.0/README.md b/specs/mvp/v3.0/README.md new file mode 100644 index 0000000..b338179 --- /dev/null +++ b/specs/mvp/v3.0/README.md @@ -0,0 +1,15 @@ +# MVP v3.0(Design)— WebUI + 用户数据上传/下载(SFTPGo)→ 首个可发布版本 + +本目录基于: +- `specs/mvp/mvp_roadmap_v2.md`(总体路线图) +- `specs/mvp/image/roadmap_v3.0.png`(v3.0 迭代图) +- 当前已落地的 v2.5(User Mgmt + Stateless Ray Node Pool) + +目标是在 v2.5 的基础上补齐 **用户数据闭环**(上传→训练可见→产物下载)以及最小可用的 **WebUI**,形成“可发布”的 v3.0 版本。 + +文档: +- `specs/mvp/v3.0/v3.0_design.md`:总体架构与关键机制(WebUI、SFTPGo、数据/权限模型、任务流)。 +- `specs/mvp/v3.0/v3.0_api.md`:v3.0 API 扩展设计(UI、数据、SFTPGo 管理、权限约束)。 +- `specs/mvp/v3.0/v3.0_acceptance.md`:部署/升级/验收流程与可验证标准(含故障注入与回归清单)。 +- `specs/mvp/v3.0/v3.0_dev_plan.md`:TDD 驱动的工程化开发计划(里程碑拆分、测试分层、E2E 验收)。 +- `specs/mvp/v3.0/v3.0_progress.md`:实施进展记录(每个里程碑完成后追加记录)。 diff --git a/specs/mvp/v3.0/v3.0_acceptance.md b/specs/mvp/v3.0/v3.0_acceptance.md new file mode 100644 index 0000000..fe710e5 --- /dev/null +++ b/specs/mvp/v3.0/v3.0_acceptance.md @@ -0,0 +1,55 @@ +# MVP v3.0 — 部署与验收流程(草案) + +## 0) 环境前提 +- Ray 集群:延续 v2.5 的 head + stateless worker(自动 join) +- 共享存储:容器内挂载 `/private`(dev/prod 对齐) +- API server:宿主机代码挂载到 head 容器,在 head 容器内启动 +- 新增:SFTPGo 服务(建议容器化部署) + +## 1) 部署步骤(高层) + +1) 部署/升级 Ray 节点镜像(沿用 v2.5 的 `argus/argus-ray-node:v2.5` 或更高版本) +2) 启动 Ray 集群(compose 或平台创建容器) +3) 启动/配置 SFTPGo(挂载 `/private`) +4) 启动 API server(head 容器内) +5) 启动 WebUI(由 API server 托管) + +## 2) 验收用例(必须通过) + +### A. 用户与凭据 +1) admin 创建用户 `alice`,签发 API token +2) 系统联动在 SFTPGo 创建 `alice`(home=/private/users/alice) +3) `alice` 使用 token 登录 WebUI(或调用 `/api/v2/me` 成功) + +### B. 上传数据闭环(核心) +1) `alice` 通过 SFTP 上传数据集到 `/private/users/alice/datasets/...` +2) `alice` 通过 WebUI/API 提交任务,TaskSpec 引用该路径 +3) Ray worker 读取该数据,任务 RUNNING 并最终 SUCCEEDED + +### C. 下载产物闭环 +1) 训练完成后,产物落到 `/private/users/alice/jobs//...` +2) `alice` 通过 SFTP 下载 checkpoints/logs 成功 +3) (新增)`alice` 将需要长期保留的权重从 `jobs//...` 移动到 `models/`,确认移动后可长期存在 + +### C2. Jobs 回收站与自动清理(3 天移入回收站,7 天后永久删除) +1) 将 `jobs_trash_after_days`/`jobs_purge_after_days` 配置为较小值(例如分钟级,用于验证) +2) 训练完成进入 terminal 状态 +3) 等待 API server 内置 janitor 扫描周期后,确认对应 `jobs/` 被移动到 `trash/jobs/` +4) 在回收站窗口内,把某个文件从 `trash/jobs/` 移动到 `models/`,确认移动成功 +5) 等待超过 `jobs_purge_after_days` 后,确认 `trash/jobs/` 被永久删除 +6) 确认已移动到 `models/` 的文件不被删除 + +### D. 安全隔离(必须) +1) `bob` 不能通过 API 查询 `alice` 的 task(404) +2) `bob` 不能提交引用 `/private/users/alice/...` 的 TaskSpec(400/403) +3) `bob` 通过 SFTP 无法访问 `/private/users/alice/...`(chroot 生效) + +## 3) 故障注入(推荐通过) +1) kill worker watchdog 或 raylet → worker 自动恢复并重新加入集群 +2) 重启 head 容器 → head 重新写 `head.json`,worker 自动重连 +3) SFTPGo 重启 → 不影响 Ray 集群;用户可重新连接上传/下载 + +## 4) 回归清单(与 v2.5 一致) +- 任务队列、重试(INSUFFICIENT_RESOURCES → PENDING_RESOURCES → retry) +- PPO/GRPO/SFT 三种 workload 均可跑通 +- head 不跑训练(driver 强制落 worker) diff --git a/specs/mvp/v3.0/v3.0_api.md b/specs/mvp/v3.0/v3.0_api.md new file mode 100644 index 0000000..67f113b --- /dev/null +++ b/specs/mvp/v3.0/v3.0_api.md @@ -0,0 +1,109 @@ +# MVP v3.0 — API 扩展设计(基于 v2.5) + +v3.0 的原则是:**尽量复用 v2.5 API**,只增量增加 “数据闭环” 与 “WebUI 支持” 所需的最小接口。 + +## 1) 认证与权限 + +沿用 v2.5: +- Header:`Authorization: Bearer ` +- admin token:来自 `MVP_INTERNAL_TOKEN` +- 普通用户 token:由 admin 颁发并持久化在 SQLite + +权限规则: +- 非 admin:只能访问自己的 task、自己的数据空间(`/private/users//...`)。 +- 跨用户访问返回 404(不泄露存在性)。 + +## 2) 用户与 SFTPGo 联动(管理员接口) + +### 2.1 创建用户(复用 v2.5) +`POST /api/v2/users` +- v3.0 行为:成功后,**可选**联动创建 SFTPGo 用户 + - v3.0 默认启用联动:创建 SFTPGo 用户 + 生成一次性密码(password 认证) + - v3.0 仅保留该方案(方案 A):不做外部认证/SSO 集成(留到更后续版本) + - `data.sftpgo.admin_api_base` 推荐形如:`http://argus-sftpgo:8080/api/v2`(包含 `/api/v2` 前缀) + +### 2.2 下发 token(复用 v2.5) +`POST /api/v2/users/{user_id}/tokens` + +### 2.3 禁用用户(复用 v2.5) +`POST /api/v2/users/{user_id}:disable` +- v3.0 行为:联动禁用 SFTPGo 用户(可选) + +### 2.4 SFTP 凭据管理(新增,管理员或用户自助) +(具体由你确认 v3.0 需要“用户自助”还是“管理员操作”) + +#### 重置 SFTP 密码(管理员) +`POST /api/v2/users/{user_id}/sftp:reset_password` +- 返回:一次性密码(只返回一次,服务端不保存明文) +> v3.0 先只做 password 方案;SSH public key 作为后续版本可选增强(不在 v3.0 范围)。 + +## 3) 用户自助信息(新增) + +### 3.1 获取当前用户信息 +`GET /api/v2/me` +- 返回示例: +```json +{ + "user_id": "alice", + "is_admin": false, + "paths": { + "home": "/private/users/alice", + "datasets": "/private/users/alice/datasets", + "models": "/private/users/alice/models", + "code": "/private/users/alice/code", + "jobs": "/private/users/alice/jobs", + "trash_jobs": "/private/users/alice/trash/jobs" + }, + "retention": { + "jobs_trash_after_days": 3, + "jobs_purge_after_days": 7 + }, + "sftp": { + "host": "h1.example.internal", + "port": 2022, + "username": "alice" + } +} +``` + +## 3.2 Jobs Retention 提示(新增) +为了支撑 WebUI 展示与用户预期管理,可在 `/api/v2/me` 或单独接口返回: +- `jobs_trash_after_days`:默认 3 +- `jobs_purge_after_days`:默认 7 +- `jobs_root`:`/private/users//jobs` +- `trash_jobs_root`:`/private/users//trash/jobs` +- `recommendations`:提示用户把需要长期保存的产物移动到 `models/` 或 `datasets/` + +## 4) 数据浏览/下载(可选,v3.0 最小化) + +说明:上传/下载主通道仍是 SFTP。 +WebUI 如果要提供“快速浏览/查看”,可实现只读接口(避免实现大文件上传/断点等复杂逻辑)。 + +### 4.1 列目录 +`GET /api/v2/files?path=/private/users/alice` +- 权限:path 必须在 `/private/common/` 或 `/private/users//` 下 +- 返回:文件列表(name/type/size/mtime) + +### 4.2 下载文件(小文件为主) +`GET /api/v2/files:download?path=/private/users/alice/jobs/.../logs/...` +- 返回:流式下载 +- 大文件仍建议走 SFTP + +## 5) TaskSpec 路径校验升级(v3.0 关键) + +v2.5:仅允许 `/private/common/...` +v3.0:允许 `/private/common/...` 与 `/private/users//...` + +应用字段(至少): +- `train_file` / `val_file` +- `code_path`:仍仅允许 `/private/common/...`(v3.0 不支持执行用户 code) +- 本地模型路径字段(如果引入):允许 `/private/users//models/...` + +## 6) WebUI 路由(新增) + +由 API server 托管: +- `GET /ui`:主页面 +- `GET /ui/login`:token 登录页 +- 静态资源:`/ui/static/...` + +WebUI 的所有操作均调用同源 API(不额外开 CORS)。 diff --git a/specs/mvp/v3.0/v3.0_design.md b/specs/mvp/v3.0/v3.0_design.md new file mode 100644 index 0000000..4edf84b --- /dev/null +++ b/specs/mvp/v3.0/v3.0_design.md @@ -0,0 +1,358 @@ +# MVP v3.0 详细设计方案(基于 v2.5) + +## 0. 结论摘要(v3.0 要交付什么) + +v3.0 = v2.5 + **WebUI** + **用户数据上传/下载(SFTPGo)**,形成第一个可对外发布的版本: +- 用户可以通过 **SFTP** 上传数据/模型/代码(至少数据),落到 GPFS(容器内 `/private`)并对 Ray worker 可见。 +- 用户可以通过 API/WebUI 提交训练任务,任务读取自己上传的数据。 +- 用户可以下载训练产物(checkpoints/logs 等),最小闭环跑通。 + +## 1. 范围与原则 + +### 1.1 继承 v2.5 的前提(不回退) +- **Stateless Ray Node Pool**:head 写 `head.json`,worker watchdog 自动 join/自愈。 +- **User Management**:token 鉴权、任务可见性隔离(跨用户 404 不泄漏)。 +- **作业产物隔离**:Ray job 目录落到 `/private/users//jobs//...`。 +- **API server 短期运行方式**:代码在宿主机,挂载到 head 容器,在 head 容器内启动(保持现状)。 + +### 1.2 v3.0 新增目标 +1) **Data Management(SFTPGo)** + - 提供用户上传/下载入口(SFTP 为主)。 + - 数据落到 GPFS(dev 环境 NFS/GPFS,生产环境 GPFS),训练 job 在 worker 容器内可直接读取。 +2) **WebUI** + - 用户可视化创建任务、查看队列/状态/日志、查看“数据路径约定”和自己的 SFTP 信息。 + - 目标是 “可用而非豪华”,支持核心工作流。 +3) **权限闭环** + - 用户只能使用自己目录下的数据(`/private/users//...`)或公共目录(`/private/common/...`)。 + - 防止用户提交任务读取其他用户的文件路径。 + +### 1.3 v3.0 明确不做(留给 v3.5) +- 不做 “自定义 reward function / 自定义 verl 代码 / 多版本 verl 共存”(路线图 v3.5)。 +- 不做复杂 Serving/训推一体(路线图 v3.5)。 +- 不做 IB 网络/拓扑优化(路线图 v3.5)。 +- 不做系统级可观测性平台(路线图 v4.0)。 + +## 2. 架构概览 + +参考 `roadmap_v3.0.png`,v3.0 的控制面与数据面: + +### 2.1 控制面(Control Plane) +- **API Server(FastAPI)** + - v2.5 的任务队列/调度/重试 + 用户管理能力继续复用 + - 新增:数据管理能力(与 SFTPGo 对接) + WebUI +- **WebUI** + - 通过 API 使用 token 登录 + - 提供任务/日志/数据入口(不直接运行训练) +- **Ray Head(状态节点)** + - 仍在 head 容器内(或单独节点) + - job server/dashbaord 提供 job submit/status/logs 能力 + +### 2.2 数据面(Data Plane) +- **GPFS(容器内挂载 `/private`)** + - 存放 common 与 users 两大根目录 +- **Ray Worker Node(无状态)** + - 自动连接 head,执行训练 + - 读取 `/private/users//...` 的数据 + +### 2.3 新增组件:SFTPGo(Data Management) +- 作为独立服务运行(容器化优先),后端存储使用 **filesystem**(GPFS 挂载路径)。 +- 用户的 home directory 指向 `/private/users/`(或其子目录)。 + +## 3. 存储与目录规范(v3.0 统一约定) + +### 3.1 目录层级 +统一以容器内 `/private` 作为根路径(dev/prod 对齐): +- `/private/common/`:公共资源 + - `hf/`:HF cache + - `datasets/`:公共数据集(可选) + - `code/`:公共代码(例如公共 verl repo snapshot) + - `db/`:SQLite(队列、用户、token) + - `logs/`:API/supervisor/watchdog 日志 +- `/private/users//`:用户空间(v3.0 重点) + - `datasets/`:用户上传的数据集(推荐) + - `models/`:用户保存/上传的本地模型(允许;也用于“把 job 产物移动到长期保存目录”) + - `code/`:用户上传的代码(v3.0 **不支持执行**;仅存放/下载) + - `jobs/`:训练任务产物(已在 v2.5 落地) + - `tmp/`:临时文件(可选) + +### 3.2 Jobs Retention(两段式:3 天移入回收站,7 天后永久删除) +v3.0 引入 **jobs 目录两段式保留策略**: +- 第 1 阶段(soft-delete):job 结束后 **3 天**,将该 job 目录从 `jobs/` **移动到用户回收目录**; +- 第 2 阶段(hard-delete):进入回收目录后再过 **7 天**,从回收目录 **永久删除**。 + +目录约定(建议): +- jobs 根目录:`/private/users//jobs//...` +- 回收目录:`/private/users//trash/jobs//...` + +计时规则: +- 以 job 进入 terminal 状态(SUCCEEDED/FAILED/CANCELED)的结束时间为起点; +- “3 天”用于从 `jobs/` 移入 `trash/jobs/`; +- “7 天”用于从 `trash/jobs/` 永久删除(即总共最多 10 天窗口)。 + +用户保留关键产物的方式(无需 keep 标记): +- 在 “3 天窗口”内把需要长期保存的文件从 `jobs//...` **移动/复制**到 `models/`(例如权重)或 `datasets/`(例如评估输出数据); +- 即便已被移动到回收目录,用户仍可在 “7 天窗口”内从 `trash/jobs//...` 把需要的文件移到 `models/` / `datasets/`; +- janitor 只管理 `jobs/` 与 `trash/jobs/`,不会触碰 `models/` 与 `datasets/`。 + +这里的“清理程序”我们称为 **janitor**: +- 定义:一个后台清理执行器,按固定周期扫描“已结束且已过期”的 job 目录并删除 +- v3.0 目标:实现“3 天移入回收站 + 7 天后删除”这一条产品规则(不提供 keep/延长保留标记) + +实现建议(按你的偏好): +- **janitor 作为 API server 内置后台线程**运行: + - 优点:天然可访问 SQLite(任务状态、结束时间、user_id、ray_submission_id),并能把清理结果写回 events 表用于审计 + - 部署更简单:不额外引入 cronjob/独立服务 +- 删除/移动动作建议 **直接在 GPFS/NFS 文件系统上操作**(API server 运行在 head 容器,已挂载 `/private`): + - 第 1 阶段:`os.rename`(同文件系统原子移动)把 `jobs/` 移到 `trash/jobs/`; + - 若跨文件系统(理论上不应发生),则降级为 copy+delete; + - 移动前做严格路径前缀校验(必须在 `.../users//jobs/` 下)。 + - 第 2 阶段:对 `trash/jobs/` 执行递归删除(例如 `shutil.rmtree`),同样做路径前缀校验(必须在 `.../users//trash/jobs/` 下)。 + - 为什么不依赖 SFTPGo API:SFTPGo 只是用户访问协议层(SFTP/Web),目录物理就在同一份文件系统;文件系统直连更简单、也不依赖 SFTPGo 在线。 +- 如果你强烈希望“通过 SFTPGo API 删除”: + - 可以作为可选实现/补充(例如用于统一审计或未来接入配额/策略),但不建议作为唯一手段(SFTPGo 停机不应阻塞清理)。 + +### 3.3 用户在 SFTPGo 内移动/整理文件(确认点) +支持用户在 SFTPGo 中进行“移动/重命名/整理”(例如把权重从 `jobs/` 移动到 `models/`): +- 前提:SFTPGo 用户权限允许对其 home 目录进行 `rename/mkdir/remove` 等操作(v3.0 默认可写)。 +- 行为:用户可以把 `jobs/` 下某些文件移动到 `models/` 或 `datasets/`,用于长期保存权重/评估产物等。 +- 与 retention 的关系:只要文件被移动出 `jobs/`,就不会被 jobs 清理逻辑删除。 + +### 3.4 路径权限规则(API 侧校验) +v2.5 约束是 “只允许 `/private/common/...`”。 +v3.0 需要升级为: +- 允许: + - `/private/common/...` + - `/private/users//...` +- 禁止: + - 任何其他绝对路径(例如 `/private/users/other/...`、`/etc/...`) + +并把该规则应用到 TaskSpec 的相关字段(至少): +- `train_file` / `val_file` +- `code_path`:仍仅允许 `/private/common/...`(v3.0 不支持执行用户 code) +- 本地模型路径字段:允许 `/private/users//models/...`(确认:v3.0 允许) + +## 4. SFTPGo 方案设计(Data Management) + +### 4.1 运行形态 +推荐用容器运行 SFTPGo(与 Ray/API 解耦),挂载同一份 `/private`: +- `sftpgo` 容器挂载 `../../shared:/private` +- 对外暴露: + - SFTP 端口(建议 2022) + - WebAdmin/API 端口(建议 8081,仅内网或管理员访问) + +#### 4.1.1 镜像来源(现成 Docker 镜像) +SFTPGo 有现成可用的 Docker 镜像(无需自建): +- v3.0 推荐优先使用官方/上游发布的 `sftpgo` 镜像作为运行基座 +- 我们在 v3.0 里不需要定制 SFTPGo 代码,只需要: + - 正确挂载 GPFS/NFS(容器内 `/private`) + - 配置管理员账号(用于 API server 联动创建/禁用用户、重置密码) + - 配置每用户 home/chroot + +> 注意:具体镜像名/tag 在不同环境可能有差异(官方/镜像仓库策略会变动)。落地时建议在 `argus@h1` 上 `docker search sftpgo` 或由你们内部镜像仓库提供固定版本;v3.0 设计只要求“使用现成镜像”,不强依赖某个 tag。 + +#### 4.1.2 docker-compose 服务草案(示意) +下面给出一个**示意**(最终以实际镜像名/tag 与你们端口规划为准): + +```yaml +services: + sftpgo: + image: sftpgo/sftpgo:latest # 示例:使用现成镜像 + container_name: argus-sftpgo + ports: + - "2022:2022" # SFTP + - "8081:8080" # WebAdmin/API(建议仅内网/管理员) + volumes: + - ../../shared:/private + - ../../shared/common/sftpgo:/var/lib/sftpgo # 持久化 SFTPGo 元数据(可选/建议) + environment: + # 管理员账号/密码(示意,具体变量名以镜像文档为准) + SFTPGO_ADMIN_USERNAME: "admin" + SFTPGO_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD}" +``` + +与 v3.0 的配合点: +- API server 使用 `data.sftpgo.admin_api_base` + admin 凭据联动创建用户 +- 用户 home/chroot 统一指向 `/private/users/` + +### 4.2 用户隔离 +每个用户在 SFTPGo 中的 home dir 绑定到: +- `/private/users/`(chroot),用户只能读写自己的目录。 + +### 4.3 用户创建与凭据管理(两种实现,建议先做 A) + +**方案 A(v3.0 推荐):API Server 负责“联动创建 SFTPGo 用户”** +- 在 v2.5 的 `POST /api/v2/users` 成功后: + - API server 调用 SFTPGo 管理 API 创建同名用户 + - 设置 home dir = `/private/users/` + - 设置权限(默认可写;是否只读可配置) +- 认证方式: + - v3.0 最小可用:用户名+密码(确认:v3.0 先 password;API 生成一次性密码,用户首次登录后要求改密) + - 或:SSH public key(WebUI 允许上传 public key,API 写入 SFTPGo) + +**方案 B(更强但复杂):SFTPGo 外部认证** +- SFTPGo 把认证委托给 API server(token/SSO),SFTP 也走内部 token。 +- 复杂度高,建议 v3.0 不做,放到 v3.5 或更后。 + +### 4.4 用户上传/下载体验 +用户通过 SFTP 上传: +- `datasets/...`(训练数据) +- `models/...`(本地模型,可选) +下载: +- `jobs//...`(checkpoints/logs) + +WebUI/文档提供 “路径如何写进 TaskSpec” 的指引。 + +## 5. WebUI 方案设计(最小可用) + +### 5.1 目标页面 +v3.0 WebUI 采用“**多子页面 + 侧边导航栏**”而不是把所有功能挤到单页: +- 原因:信息密度更可控,后续可扩展(v3.5+)且不会把一个页面做成“巨型表单/巨型列表”。 +- 实现仍保持轻量:服务端渲染(或静态 HTML + 少量 JS),不引入复杂前端工程。 + +信息架构(IA)建议如下: +1) **登录页**(`/ui/login`) + - 用户粘贴 token(管理员发放),浏览器保存(localStorage/sessionStorage) + - 提供“退出登录/清空 token” +2) **任务列表页**(`/ui/tasks`) + - 默认列表:最近 N 条任务(按 created_at 倒序) + - 支持过滤:workload、state(QUEUED/RUNNING/SUCCEEDED/FAILED/CANCELED)、时间范围 + - 支持快捷操作:进入详情、取消任务 +3) **新建任务页**(`/ui/tasks/new`) + - 两种模式(二选一,均可实现): + - **YAML 直接提交**:上传/粘贴 TaskSpec YAML(最省开发) + - **表单生成 YAML**:选择 workload,填写核心字段(train/val/model/nnodes/gpus),生成 YAML 预览后提交 + - 提交后跳转到任务详情页 +4) **任务详情页**(`/ui/tasks/{task_id}`) + - 顶部:task_id、workload、state、created_at、updated_at、error_summary + - Attempt 卡片:latest attempt_no、ray_submission_id、ray_status、start/end + - 操作区:取消任务(若非 terminal)、刷新状态、复制路径/ID + - 链接到日志页与产物提示(SFTP 路径) +5) **任务日志页**(`/ui/tasks/{task_id}/logs`) + - 默认 tail=2000,可选 200/1000/5000 + - 提供“自动刷新(每 3~5 秒)”开关(简单轮询即可) +6) **数据页**(`/ui/data`) + - 显示 SFTP 连接信息(host/port/username) + - 显示用户目录约定: + - home:`/private/users/` + - datasets:`/private/users//datasets` + - models:`/private/users//models` + - jobs:`/private/users//jobs` + - trash/jobs:`/private/users//trash/jobs` + - 明确 retention:jobs 结束后 3 天移入回收站,回收站 7 天后删除;重要文件请移到 `models/` 或 `datasets/` +7) **(仅管理员可见)用户管理页**(`/ui/admin/users`,可选但很有价值) + - 创建用户、禁用用户、签发 token、重置 SFTP 密码(方案 A) + +### 5.2 页面组织与导航(建议) +侧边栏导航(普通用户): +- Tasks(列表) +- New Task(新建) +- Data(SFTP/目录说明) + +管理员侧边栏额外增加: +- Admin / Users + +### 5.3 大致示意图(wireframe) + +下面是一个粗略示意(非最终 UI,仅表达信息结构与布局): + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Argus MVP v3.0 [user: alice] │ +├───────────────┬──────────────────────────────────────────────────────┤ +│ Side Nav │ /ui/tasks │ +│ │ │ +│ • Tasks │ [Filter] workload=all state=all [Search task_id] │ +│ • New Task │ │ +│ • Data │ Task List │ +│ • Admin(*) │ ┌────────────────────────────────────────────────┐ │ +│ │ │ task_id workload state ... │ │ +│ │ │ mvp2-alice-ppo-... ppo RUNNING ... │ │ +│ │ │ mvp2-alice-sft-... sft SUCCEEDED... │ │ +│ │ └────────────────────────────────────────────────┘ │ +│ │ [View] [Cancel] │ +└───────────────┴──────────────────────────────────────────────────────┘ +``` + +任务详情页(示意): +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ /ui/tasks/{task_id} │ +├──────────────────────────────────────────────────────────────────────┤ +│ task_id: mvp2-alice-ppo-... state: RUNNING workload: ppo │ +│ created_at: ... updated_at: ... │ +│ error_summary: (empty) │ +│ │ +│ latest_attempt: a01 ray_submission_id: ...--a01 ray_status: RUNNING │ +│ [Open Logs] [Cancel Task] [Refresh] │ +│ │ +│ Artifacts (SFTP paths): │ +│ jobs/: /private/users/alice/jobs// │ +│ trash/: /private/users/alice/trash/jobs// │ +│ tip: move important files to /private/users/alice/models/ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### 5.2 技术取舍(建议:不引入 Node 构建) +为了降低部署复杂度,建议 v3.0 WebUI 以 “服务端渲染 + 少量 JS/HTMX” 或 “纯静态 HTML+fetch” 实现: +- 由 API server 提供静态资源(FastAPI StaticFiles) +- 页面调用同源 API,避免跨域与复杂前端构建链 + +## 6. API 扩展设计(概览) + +v3.0 可以保持 `/api/v2/...` 不变,增量加: +- SFTPGo 集成管理端点(管理员): + - 创建/禁用用户时联动 SFTPGo + - 重置 SFTP 密码 / 更新 SSH key +- 用户数据端点(可选,最小化): + - `/api/v2/me`:返回 user_id、SFTP 信息(host/port/home) + - `/api/v2/files`:仅用于浏览/下载(上传仍走 SFTP) + +详细见 `specs/mvp/v3.0/v3.0_api.md`。 + +## 7. 配置与部署(v3.0 新增项) + +在 `configs/dev.yaml` 基础上扩展一组 `data` 配置(示意): +```yaml +data: + shared_root: "/private" # 通常与 ray.shared_root 一致 + user_root: "/private/users" # 用户空间根目录 + allow_common_prefix: "/private/common/" + allow_user_prefix_template: "/private/users/{user_id}/" + + sftpgo: + enabled: true + host: "127.0.0.1" + sftp_port: 2022 + admin_api_base: "http://127.0.0.1:8081/api/v2" + admin_user: "admin" + admin_password_env: "SFTPGO_ADMIN_PASSWORD" # 仅 head 容器内可读 + + retention: + jobs_trash_after_days: 3 + jobs_purge_after_days: 7 + trash_root_template: "/private/users/{user_id}/trash/jobs" + janitor_interval_s: 3600 # 每小时扫一次(可配置) +``` + +## 8. 风险点与对策 + +1) **路径逃逸/越权读取** + - 必须在 API 提交任务时校验路径前缀 + - SFTPGo 必须 chroot 到用户 home +2) **大文件上传稳定性** + - 优先用 SFTP(断点续传/可靠性更好) +3) **用户 token 与 SFTP 凭据的生命周期** + - token 走 v2.5 SQLite + - SFTP 凭据建议独立(密码/SSH key),并提供 reset 流程 +4) **GPFS/NFS 权限** + - 确保 `/private/users/` 目录权限可被 SFTPGo 写入且 worker 可读 + +## 9. 已确认结论(来自你的反馈) +1) 允许用户上传并在训练时使用自定义数据集:允许(`/private/users//datasets/...`)。 +2) 允许用户上传并在训练时使用本地模型路径:允许(`/private/users//models/...`)。 +3) v3.0 不允许执行用户自定义代码(不注入 `PYTHONPATH` 作为可执行 code path)。 +4) SFTPGo 认证方式:v3.0 先 password。 +5) WebUI:按“简单最小必要功能”做(token 粘贴登录优先)。 + +## 10. 待确认问题(需要你给结论) +(已确认)jobs 清理执行主体:v3.0 采用 **API server 内置 janitor 后台线程**。 diff --git a/specs/mvp/v3.0/v3.0_dev_plan.md b/specs/mvp/v3.0/v3.0_dev_plan.md new file mode 100644 index 0000000..9a14b4b --- /dev/null +++ b/specs/mvp/v3.0/v3.0_dev_plan.md @@ -0,0 +1,232 @@ +# MVP v3.0 开发计划(TDD 驱动) + +本文是 v3.0 的**工程化开发计划**,强调“先写测试,再写实现”(TDD),并将每个里程碑拆成**可独立验收**的小闭环。 + +输入依据: +- 路线图:`specs/mvp/mvp_roadmap_v2.md` +- v3.0 设计:`specs/mvp/v3.0/v3.0_design.md` +- v3.0 API:`specs/mvp/v3.0/v3.0_api.md` +- v3.0 验收:`specs/mvp/v3.0/v3.0_acceptance.md` +- 现状基线:v2.5(Task queue + User mgmt + Stateless ray pool + 单镜像节点守护) + +v3.0 已确认约束: +- 允许用户数据集路径:`/private/users//datasets/...` +- 允许用户本地模型路径:`/private/users//models/...` +- **不允许执行用户自定义代码**(不注入 user code 到 PYTHONPATH;`code_path` 仍只允许 `/private/common/...`) +- SFTPGo 先用 **password** 方案(方案 A:API 联动创建/管理 SFTPGo 用户) +- jobs retention:**3 天移入回收站(trash/jobs),再 7 天永久删除**;不提供 keep/延长保留标记 +- janitor:**API server 内置后台线程**;删除/移动采用**文件系统直接操作**(不依赖 SFTPGo API) + +--- + +## 0. TDD 规范(所有功能都遵循) + +### 0.1 测试分层 + +1) **单元测试(fast)** +- 纯 Python 逻辑:路径策略、SFTPGo client、retention 计算、文件移动/删除策略(用临时目录)。 +- 不依赖真实 Ray、不依赖 docker、不依赖网络。 + +2) **组件测试(中等)** +- FastAPI 路由(含 WebUI 路由):`fastapi.testclient.TestClient` +- mock/stub SFTPGo client 与 ray client + +3) **端到端(慢)** +- 在 `argus@h1` 通过 docker compose + scripts: + - Ray 集群自动起来(head+2 worker) + - SFTPGo 服务可用 + - 上传数据 → 提交训练 → 下载产物 → jobs 回收站/清理 + +### 0.2 代码与测试约定 +- 测试目录:`src/mvp/py/tests/` +- 新功能必须先补齐测试用例,并让其在未实现时失败(红) +- 最小实现让测试变绿(绿) +- 再做重构(重构) +- 覆盖率:继续沿用当前阈值(>= 90%) + +--- + +## 1. 里程碑拆分(v3.0 = 5 个可验证闭环) + +### M1:TaskSpec 路径策略升级(允许 user datasets/models;code_path 仍仅 common) + +**目标** +- API submit 时的路径校验从 v2.5 的 “仅 `/private/common/`” 升级为: + - `train_file` / `val_file`:允许 `/private/common/...` 与 `/private/users//...` + - 本地模型路径:允许 `/private/users//models/...`(不改变 YAML 结构,见实现建议) + - `code_path`:仍仅允许 `/private/common/...` +- 阻止越权路径(`/private/users/other/...`)与非 `/private/...` 路径。 + +**实现建议(不扩展 TaskSpec)** +- `model_id` 字段保持不变: + - 若 `model_id` 以 `/private/` 开头 → 视作本地模型路径 + - 否则视作 HuggingFace repo id(如 `Qwen/...`) + +**TDD 用例(先写测试)** +- 单测: + - `test_paths_allow_common_and_own_user_prefix()` + - `test_paths_reject_other_user_prefix()` + - `test_model_id_local_path_allowed_only_under_users_models()` + - `test_code_path_still_common_only()` +- API 测试: + - `test_submit_accepts_user_datasets_paths()` + - `test_submit_rejects_cross_user_paths_404_or_400()`(按约定返回 400/403) + +**验收点** +- `v3.0_acceptance.md` 的 D 类安全隔离用例可由 API 测试覆盖。 + +--- + +### M2:SFTPGo 集成(方案 A:用户联动创建 + password) + +**目标** +- 引入 `data management (SFTPGo)`: + - admin 创建用户时联动创建 SFTPGo 用户(home=/private/users/,chroot) + - password 模式:生成一次性密码(reset/create)并返回给 admin(明文只返回一次) +- 提供用户自助信息: + - `GET /api/v2/me` 返回 SFTP 连接信息、目录约定、retention 提示。 + +**实现建议** +- 新增 `SFTPGoAdminClient`(同步调用): + - 通过 `urllib` 或 `httpx`(建议 `urllib`,减少依赖;禁止 hard-code requests 使用) + - 支持:create user / disable user / reset password(最小集合) +- API server 启动时校验配置(enabled 时必须具备 admin 密码 env)。 +- 同步创建用户目录结构(文件系统): + - `/private/users//{datasets,models,code,jobs,trash/jobs}`(最小必需) + +**TDD 用例(先写测试)** +- 单测: + - `test_sftpgo_client_builds_correct_requests()`(不发真实网络;mock urlopen) + - `test_user_dirs_created_on_user_create()`(tmp dir 断言目录存在) +- API 测试: + - `test_create_user_calls_sftpgo_client()`(stub client,断言调用参数) + - `test_me_returns_sftp_info_and_paths()`(含 trash/jobs 与 TTL 字段) + +**验收点** +- `v3.0_acceptance.md` 的 A 类(用户/凭据)与 B 类(上传闭环前置)覆盖。 + +--- + +### M3:WebUI(最小可用,多页面 + 侧边栏) + +**目标** +- WebUI 由 API server 托管(同源,无额外 CORS): + - `/ui/login`:token 粘贴登录(localStorage) + - `/ui/tasks`:任务列表 + 过滤(最小) + - `/ui/tasks/new`:YAML 提交(优先)+(可选)表单生成 YAML + - `/ui/tasks/{task_id}`:详情页 + - `/ui/tasks/{task_id}/logs`:日志 tail + 可选自动刷新 + - `/ui/data`:SFTP 信息 + 目录/retention 提示 + - (可选)`/ui/admin/users`:管理员用户管理(若时间允许,强烈建议) + +**实现建议** +- 先不引入 Node 构建: + - HTML 模板可用最简单的字符串拼接或 Jinja2(若引入 jinja2,则补齐依赖与测试) + - 页面通过 fetch 调用 `/api/v2/...`,并复用 token header + +**TDD 用例(先写测试)** +- 组件测试(TestClient): + - `test_ui_routes_render_200()` + - `test_ui_contains_sidebar_links()`(简单断言文本包含导航链接) + - `test_ui_tasks_detail_shows_ids()`(包含 task_id、state、ray_submission_id) + +**验收点** +- WebUI 能完成:登录→创建任务→查看任务→查看日志→看到 data 页提示。 + +--- + +### M4:Jobs Retention janitor(3 天移入 trash,7 天后 purge) + +**目标** +- API server 内置 janitor 后台线程: + - 周期性扫描 DB 中 terminal tasks + - 到期后执行: + - move:`/private/users//jobs/` → `/private/users//trash/jobs/` + - purge:递归删除 `/private/users//trash/jobs/` + - 全程严格 path 校验,禁止越界删除 + - 清理操作记录到 DB events(审计) + +**实现建议(数据与状态)** +- 需要稳定的时间锚点与幂等: + - 使用 attempts.end_time 作为 job 结束时间(latest attempt) + - 在 tasks 表新增字段(或新表)记录: + - `trashed_at`(首次成功 move 时间) + - `purged_at`(成功删除时间) + - `trash_path`(可选) + - 幂等:重复运行不会报错(目录不存在视为已处理) + +**TDD 用例(先写测试)** +- 单测(用 tmpdir 构造 jobs/trash 目录): + - `test_janitor_moves_job_to_trash_after_threshold()` + - `test_janitor_purges_trash_after_threshold()` + - `test_janitor_never_touches_models_or_datasets()` + - `test_janitor_path_escape_rejected()`(恶意 path 不可删) +- API/组件测试: + - `test_me_includes_retention_fields()`(jobs_trash_after_days/jobs_purge_after_days) + +**验收点** +- `v3.0_acceptance.md` 的 C2 用例可按“把阈值调小到分钟级”完成验证。 + +--- + +### M5:端到端(h1)— SFTP 上传→训练→产物下载→回收站/清理 + +**目标** +- 在 `argus@h1` 落一个一键脚本(或手册)跑通: + 1) `docker compose up -d` 拉起 Ray(head+2 worker)+ SFTPGo + 2) admin 创建用户 alice(联动创建 SFTPGo 用户 + password) + 3) alice 通过 SFTP 上传: + - 数据集到 `/private/users/alice/datasets/...` + - (可选)本地模型到 `/private/users/alice/models/...` + 4) alice 通过 API/WebUI 提交任务引用上述路径 + 5) 任务成功后: + - 从 `jobs/` 下载 logs/checkpoints + - 把权重移动到 `models/`,验证不会被清理 + 6) 把 retention 配置调小,验证 jobs→trash→purge + +**交付建议** +- 新增脚本(命名示例): + - `scripts/run_all_v30_api.sh` + - `scripts/run_e2e_v30_cases.sh` +- 新增 `docker-compose.yaml` 中的 `sftpgo` service(或 `docker-compose.v30.yaml` 叠加文件) + +**验收点** +- `v3.0_acceptance.md` 全部 MUST 用例通过。 + +--- + +## 2. 风险与测试关注点 + +1) **权限与路径逃逸** +- path policy 必须覆盖:train/val/model_id(local)/output dirs(jobs/trash) +- 所有删除/移动必须做 prefix 校验 + +2) **并发与竞态** +- janitor 只处理 terminal tasks,避免清理正在写入的目录 +- move 使用同文件系统 `os.replace`(原子) + +3) **SFTPGo 可用性** +- SFTPGo 不在线不应影响训练与 API 核心功能(除了用户创建联动) +- janitor 不依赖 SFTPGo(文件系统直连) + +--- + +## 3. 交付清单(代码/配置/脚本/文档) + +### 3.1 代码 +- Path policy(v3.0) +- SFTPGoAdminClient + user create/disable/reset password 联动 +- `/api/v2/me` 扩展(SFTP/目录/retention) +- WebUI 路由与静态资源 +- janitor(trash+purge)后台线程 + DB 记录 + +### 3.2 配置 +- `configs/dev.yaml` 增加 `data.sftpgo`、`data.retention` 段(详见设计文档) + +### 3.3 scripts / compose +- compose 增加 `sftpgo`(或新增 overlay compose 文件) +- v3.0 e2e 脚本(上传/下载/清理验证) + +### 3.4 文档 +- 更新 `specs/mvp/v3.0/*` 与 `src/mvp/README.md`(运行方式、路径约定、SFTP 操作、retention 解释) + diff --git a/specs/mvp/v3.0/v3.0_progress.md b/specs/mvp/v3.0/v3.0_progress.md new file mode 100644 index 0000000..49ebfe6 --- /dev/null +++ b/specs/mvp/v3.0/v3.0_progress.md @@ -0,0 +1,154 @@ +# MVP v3.0 进展记录(milestone log) + +本文档用于记录 v3.0 按 `specs/mvp/v3.0/v3.0_dev_plan.md` 实施过程中的里程碑完成情况。 +约定:每完成一个里程碑,追加一条记录,包含**日期**、**完成内容**、**涉及文件**、**验证方式/结果**、**待办/风险**。 + +--- + +## M1:Path policy + tests(已完成) + +- 日期:2025-12-30 +- 范围:按 v3.0 路径策略升级 API submit 的路径校验(不扩展 TaskSpec YAML 结构)。 +- 完成内容: + - `code_path`:仍只允许 `/private/common/...`(v3.0 不执行 user code)。 + - `train_file`/`val_file`:允许 `/private/common/datasets/...` 或 `/private/users//datasets/...`。 + - `model_id`:若以 `/private/` 开头则视为本地路径,仅允许: + - `/private/common/models/...` 或 + - `/private/users//models/...` + 否则仍按 HuggingFace repo id(如 `Qwen/...`)处理。 + - 拒绝跨用户路径(例如 `bob` 提交 `/private/users/alice/datasets/...`)。 + - 拒绝本地模型路径不在 `models/`(例如指向 `jobs/`)。 +- 涉及文件: + - `src/mvp/py/argus/service/app.py` + - `src/mvp/py/tests/test_users.py` +- 验证方式与结果: + - 本地单测:`.venv/bin/python -m pytest -q` + - 结果:全部通过(`54 passed`),覆盖率阈值保持 `>= 90%`。 +- 待办/风险: + - `model_id=/private/...` 的“本地模型路径语义”需要在用户文档/WebUI 中明确提示(避免误用)。 + - 后续 M2/M3 需要把该路径策略同步到 UI 表单/提示文本(避免用户填错路径)。 + +--- + +## M2:SFTPGo 集成(方案 A:用户联动创建 + password)(已完成) + +- 日期:2025-12-30 +- 范围:SFTPGo(Data Management)最小集成 + 用户自助信息 `/api/v2/me` + 用户目录结构落盘。 +- 完成内容: + - 新增 `data` 配置段: + - `data.user_root`:用户数据根目录(默认 `/private/users`) + - `data.sftpgo`:SFTPGo 可选联动(enabled/host/sftp_port/admin_api_base/admin_user/admin_password_env) + - `data.retention`:jobs 过期策略配置(3 天移入 trash,7 天 purge;janitor 在 M4 实现) + - 新增 `SFTPGoAdminClient`(`urllib` 实现,不使用 `requests`): + - `create_user` / `disable_user` / `reset_password`(最小集合) + - API server 增强: + - `POST /api/v2/users`:创建 DB user + 同步创建目录结构(`datasets/models/code/jobs/trash/jobs`) + - 当 `data.sftpgo.enabled=true` 时,创建用户会联动调用 SFTPGo admin API,并返回一次性密码(明文仅返回一次,服务端不保存) + - `POST /api/v2/users/{user_id}:disable`:禁用用户(SFTPGo 禁用 best-effort) + - `POST /api/v2/users/{user_id}/sftp:reset_password`:管理员重置一次性密码(SFTPGo enabled 才允许) + - `GET /api/v2/me`:返回当前用户的目录约定、retention 提示,以及(可选)SFTP 连接信息 + - 同步更新 `src/mvp/configs/dev.yaml`:补齐 v3.0 相关 `data.*` 配置(默认关闭 sftpgo)。 +- 涉及文件: + - `src/mvp/py/argus/service/config.py` + - `src/mvp/py/argus/service/sftpgo.py` + - `src/mvp/py/argus/service/app.py` + - `src/mvp/py/tests/test_sftpgo.py` + - `src/mvp/py/tests/test_users.py` + - `src/mvp/py/tests/test_app.py` + - `src/mvp/py/tests/test_service_config.py` + - `src/mvp/configs/dev.yaml` + - `specs/mvp/v3.0/v3.0_api.md` +- 验证方式与结果: + - 本地单测:`.venv/bin/python -m pytest -q` + - 结果:全部通过(`62 passed`),覆盖率 `90.11%`(阈值 `>= 90%`)。 +- 待办/风险: + - M2 仅做了“API 侧联动 + 单测”,未在真实 SFTPGo 容器上端到端验证(按计划在 M5 完成)。 + - 目录创建依赖文件系统权限:生产部署时需确保 API/head 容器对 `/private/users` 可写。 + +--- + +## M3:WebUI(最小可用,多页面 + 侧边栏)(已完成) + +- 日期:2025-12-30 +- 范围:API server 托管最小 WebUI(同源,不引入 Node 构建),用于登录/提交/查看任务与日志、查看 data 信息。 +- 完成内容: + - 新增 UI 路由(HTML+少量 JS): + - `/ui`(重定向到 tasks) + - `/ui/login`:token 粘贴并写入浏览器 localStorage(key=`mvp_token`) + - `/ui/tasks`:任务队列列表(调用 `/api/v2/queue`) + - `/ui/tasks/new`:提交 TaskSpec YAML(POST `/api/v2/tasks`) + - `/ui/tasks/{task_id}`:任务详情(GET `/api/v2/tasks/{task_id}`,支持 cancel) + - `/ui/tasks/{task_id}/logs`:日志查看(GET `/api/v2/tasks/{task_id}/logs`,可选自动刷新) + - `/ui/data`:展示 `/api/v2/me` 返回的路径/SFTP/retention 信息 + - 统一侧边栏导航:Tasks / New Task / Data / Login。 + - UI 不做服务端 session:所有 API 调用均由浏览器带 `Authorization: Bearer `(localStorage 注入)。 +- 涉及文件: + - `src/mvp/py/argus/service/ui.py` + - `src/mvp/py/argus/service/app.py` + - `src/mvp/py/tests/test_ui.py` +- 验证方式与结果: + - 本地单测:`.venv/bin/python -m pytest -q` + - 结果:全部通过(`65 passed`),覆盖率 `90.53%`(阈值 `>= 90%`)。 +- 待办/风险: + - WebUI 当前为“骨架+API 驱动”,不做复杂交互与大文件下载;上传/下载仍以 SFTP 为主(按设计)。 + - Starlette TestClient 的 `allow_redirects` 有弃用告警(不影响功能,可在后续清理)。 + +--- + +## M4:Jobs Retention janitor(3 天移入 trash,7 天后 purge)(已完成) + +- 日期:2025-12-30 +- 范围:API server 内置后台线程,对“已结束 attempt”的 job 目录执行保留策略(文件系统直连,不依赖 SFTPGo)。 +- 完成内容: + - 新增 `JobsJanitor`: + - 以 `attempts.end_time` 为基准计算 TTL(从 job 结束开始算) + - `>= 3 天 && < 7 天`:把目录从 `.../jobs/` 移动到 `.../trash/jobs/` + - `>= 7 天`:确保目录进入 trash 后删除(`shutil.rmtree`) + - 对缺失目录、异常移动/删除为 best-effort(不影响服务主流程) + - DB 增强:新增查询 `list_ended_attempts_before()`,用于 janitor 扫描候选 attempt。 + - API server 启动时启动 janitor 线程(可通过 `data.retention.janitor_interval_s` 控制;<=0 视为关闭)。 +- 涉及文件: + - `src/mvp/py/argus/service/janitor.py` + - `src/mvp/py/argus/service/db.py` + - `src/mvp/py/argus/service/app.py` + - `src/mvp/py/tests/test_janitor.py` +- 验证方式与结果: + - 本地单测:`.venv/bin/python -m pytest -q` + - 结果:全部通过(`75 passed`),覆盖率 `90.72%`(阈值 `>= 90%`)。 +- 待办/风险: + - M4 只做“逻辑 + 单测”,实际 `/private/users/...` 的权限与在 `argus@h1` 的行为验证放到 M5(端到端)。 + +--- + +## M5:端到端(h1)— SFTPGo compose + v3.0 E2E 脚本(已完成:交付脚本/配置) + +- 日期:2025-12-30 +- 范围:补齐 h1 端到端所需的 compose/service、配置与一键脚本(实际运行/验收由你在 `argus@h1` 执行)。 +- 完成内容: + - SFTPGo 集成到 `docker compose`: + - 新增 `argus-sftpgo` service(SFTP 2022;Admin API/UI 8080→host 8081,避免与 MVP API 8080 冲突) + - 同挂载 `../../shared:/private`,并持久化元数据到 `../../shared/common/sftpgo` + - SFTPGoAdminClient 实装(对齐 upstream OpenAPI): + - `GET /api/v2/token`(BasicAuth)获取 admin token + - `POST /api/v2/users` 创建用户(含 `permissions: {"/":["*"]}`) + - `PUT /api/v2/users/{username}` 禁用/重置密码 + - 新增 v3.0 dev 配置:`configs/dev_v30.yaml`(启用 `data.sftpgo` 并配置 `admin_api_base=http://argus-sftpgo:8080/api/v2`) + - 新增 v3.0 一键脚本: + - `scripts/run_all_v30_api.sh`:起 Ray+SFTPGo、启动 API、创建用户并提交 PPO/GRPO/SFT(引用 user dataset 路径) + - `scripts/run_e2e_v30_cases.sh`:最小 E2E runner(HP-1) + - API 启动脚本增强:`scripts/60_start_api.sh` 支持透传 `SFTPGO_ADMIN_PASSWORD` 到 head 容器内的 API 进程。 +- 涉及文件: + - `src/mvp/docker-compose.yaml` + - `src/mvp/configs/dev_v30.yaml` + - `src/mvp/scripts/run_all_v30_api.sh` + - `src/mvp/scripts/run_e2e_v30_cases.sh` + - `src/mvp/scripts/60_start_api.sh` + - `src/mvp/py/argus/service/sftpgo.py` + - `src/mvp/py/tests/test_sftpgo.py` + - `src/mvp/README.md` + - `specs/mvp/v3.0/v3.0_api.md` +- 验证方式与结果: + - 本地单测:`.venv/bin/python -m pytest -q` + - 结果:全部通过(`75 passed`),覆盖率 `90.35%`(阈值 `>= 90%`)。 +- 待办/风险: + - 需要你在 `argus@h1` 实跑 `scripts/run_all_v30_api.sh` 完成真正的 SFTP 上传/下载与 retention 验收(按 `v3.0_acceptance.md`)。 diff --git a/specs/mvp/v3.0/v3.0_summary.md b/specs/mvp/v3.0/v3.0_summary.md new file mode 100644 index 0000000..c952025 --- /dev/null +++ b/specs/mvp/v3.0/v3.0_summary.md @@ -0,0 +1,166 @@ +# MVP v3.0 迭代总结(Ray + SFTPGo + API + WebUI) + +本文总结 v3.0 迭代最终落地的功能、架构、运行方式、验收点与已知限制,便于后续评审、交接与继续迭代。 + +相关更详细文档: +- `specs/mvp/v3.0/v3.0_design.md` +- `specs/mvp/v3.0/v3.0_api.md` +- `specs/mvp/v3.0/v3.0_dev_plan.md` +- `specs/mvp/v3.0/v3.0_acceptance.md` +- `specs/mvp/v3.0/v3.0_progress.md` + +--- + +## 1. 目标与范围 + +v3.0 作为“第一版可发布”的最小闭环,主要新增: +- **WebUI**:最小可用的人机界面(登录、任务提交与查看、数据入口、管理员入口)。 +- **用户管理**:基于内部 token 的用户体系(admin 与普通用户),支持创建用户与签发 token。 +- **数据管理入口(SFTPGo)**:用户通过 SFTP/WebClient 上传下载自己的数据;同时暴露只读的共享数据/缓存目录(common)用于复用。 +- **保持训练闭环**:仍通过 Ray Job 提交到集群执行(PPO/GRPO/SFT 三类 workload 都验证)。 + +明确不做(本迭代保持最小): +- 不支持用户自定义训练代码(TaskSpec 的 `code_path` 固定走 common 下的 verl snapshot 策略)。 +- 不做复杂资源排队优化/多集群/多租隔离策略(目前隔离粒度主要在用户 jobs 目录层)。 + +--- + +## 2. 系统架构(最终形态) + +核心组件: +- **Ray 集群(容器)** + - `argus-ray-head`:head 节点(无 GPU/不跑训练),提供 Ray Dashboard 与 Job Server。 + - `argus-ray-worker-0/1`:worker 节点(有 GPU),承载训练任务。 + - worker 以 “stateless + watchdog 自动连接 head” 的方式加入集群。 +- **API Server(运行在 head 容器内)** + - 读取 YAML 配置(dev/prod),维护任务队列(sqlite),并周期性调度将任务提交到 Ray。 + - 同时承载 WebUI(`/ui`)。 +- **SFTPGo(容器)** + - 提供 SFTP(端口 `2022`)与 Web Client/Admin(端口 `8081` 映射到容器 8080)。 + - 用户 home 为 `/private/users/`,默认可读写。 + - 额外提供 `/common/*` 共享只读入口(见第 4 节)。 +- **共享存储(NFS/GPFS 等挂载到容器内 `/private`)** + - `/private/common`:共享缓存(hf、datasets、models、db、logs 等)。 + - `/private/users/`:用户隔离目录(jobs/datasets/models/code/trash 等)。 + +--- + +## 3. 任务与调度(Task / Ray Job) + +### 3.1 Task(平台概念) +- 用户向 API 提交 TaskSpec(YAML),平台分配 `task_id`(可读、包含用户名)。 +- `task_id` 对应内部状态机与重试逻辑;底层每次提交 Ray Job 会产生 attempt 与 `ray_submission_id`。 + +### 3.2 Ray Job(Ray 概念) +- 真正执行训练的 driver 通过 Ray Job 运行在集群 worker 上(避免 head 承载训练)。 +- head 节点通过 `--num-cpus=0` / 自定义资源等策略避免调度到 head。 + +### 3.3 VERL 资源预检查的处理 +- VERL 在创建资源池时会做 fail-fast 资源预检查(如“可用 GPU 不足”直接报错退出)。 +- v3.0 延续 v2.x 的策略:服务端识别失败原因并按策略重试/回退(具体见 scheduler 实现与 v2.5/3.0 文档)。 + +--- + +## 4. 数据管理(SFTPGo)与 common 只读目录 + +### 4.1 用户目录(读写) +- 用户通过 SFTP/WebClient 访问自己的 home:`/private/users/` +- 目录结构(至少):`datasets/ models/ code/ jobs/ trash/ common/` + +### 4.2 common 只读(方案 A:Virtual Folder) +本迭代采用 SFTPGo 的 Virtual Folder + 路径权限覆盖,实现用户可读共享目录但不可写。 + +最终对外暴露为: +- `/common/datasets`(只读) + - **mapped_path 指向真实目录 `/private/datasets`**(避免 `/private/common/datasets` 中大量 symlink 导致的 WebClient “权限不足/越界”问题) +- `/common/hf`(只读) + - mapped_path 指向 `/private/hf` + +备注: +- `/private/common/datasets` 内部存在 symlink(如 `gsm8k -> /private/datasets/gsm8k`),如果虚拟目录映射到 symlink 根目录,SFTPGo 会把 symlink 跳转视为“逃逸 root”,导致点击进入时报权限不足;因此选择直接映射到真实目录根。 + +--- + +## 5. WebUI(最小可用) + +入口: +- `/ui/login`:粘贴 token(存 browser `localStorage`) +- `/ui/tasks`:任务列表(Running/Pending/Completed),Completed 支持分页 +- `/ui/tasks/new`:提交任务(PPO/GRPO/SFT 三套样例可一键填充) +- `/ui/data`:展示当前用户名、支持重置 SFTPGo 密码并复制;提供跳转到 SFTPGo WebClient;提示 FileZilla 等客户端用法 +- `/ui/admin`:管理员入口(创建用户、签发 token、用户列表) +- 导航栏提供 Ray Dashboard 快捷跳转(当前 IP 的 `:8265`) + +关于 admin 页面权限: +- admin 页面本身可访问,但其数据请求必须携带 admin token;否则会在页面内显示 401/403/错误信息(满足“需要先提供 admin token 才能看到内容”)。 + +--- + +## 6. API(v3.0 新增/强化点) + +核心接口(节选): +- 认证: + - Bearer token:`MVP_INTERNAL_TOKEN`(admin)或用户 token(由 admin 签发) +- 用户管理(admin): + - `POST /api/v2/users` 创建用户(并初始化用户目录) + - `GET /api/v2/users` 获取用户列表(包含最新 token、创建/更新时间等) + - `POST /api/v2/users/{user_id}/tokens` 签发用户 token +- 任务: + - `POST /api/v2/tasks` 提交 TaskSpec(YAML) + - `GET /api/v2/tasks` 任务列表(支持 states/limit/offset,用于 Completed 分页) + - `GET /api/v2/tasks/{task_id}`、`POST /api/v2/tasks/{task_id}:cancel`、`GET /api/v2/tasks/{task_id}/logs` + - `GET /api/v2/queue`(运行中/待调度概览) +- 数据/SFTP: + - `GET /api/v2/me` 返回用户路径信息、SFTP 连接信息,并 best-effort 对齐 SFTPGo 用户配置 + - `POST /api/v2/me/sftp:reset_password` 用户自助重置 SFTPGo 密码(一次性返回明文) + +安全取舍说明(当前为内网/开发优先): +- 为了 Admin WebUI “可查看并复制 token”,数据库持久化存储了 `token_plain`(明文 token)。 + - 这在生产场景通常不建议;未来可改为只展示“重置/重新签发”而不回显明文,或只回显一次。 + +--- + +## 7. 持久化与清理 + +- 任务队列:sqlite(WAL 模式) +- SFTPGo:自带 sqlite db(容器挂载持久化目录) +- Jobs 目录清理策略(服务端 janitor): + - job 结束后 3 天移动到回收目录(trash) + - 回收目录再保留 7 天后删除 + +--- + +## 8. 运行方式与脚本 + +开发/验收脚本: +- `src/mvp/scripts/run_all_v30_api.sh`:端到端拉起(Ray + SFTPGo + API),并通过 API 提交 PPO/GRPO/SFT,等待完成并验收 +- 其他脚本用于启动/停止 API、准备数据与模型、探测服务就绪等(详见 scripts 目录与 README) + +典型端到端(示例参数): +- `MVP_INTERNAL_TOKEN=my-dev-token` +- `SFTPGO_ADMIN_PASSWORD=my-dev-sftpgo-admin` +- 支持 `RESET_DB/RESET_SFTPGO` 用于测试环境重置 + +--- + +## 9. 验证结果(已跑通) + +在 `argus@h1` 环境中已完成端到端验证: +- Ray 集群可用(head + 2 worker) +- API server + WebUI 可用 +- SFTPGo(admin + 普通用户)可用 +- 通过 API 连续提交 PPO/GRPO/SFT 三种任务均能完成(SUCCEEDED) +- 用户可以登录 SFTPGo WebClient/SFTP,访问自己的目录,并访问 `/common/datasets`、`/common/hf` 的只读内容 + +同时本地单测通过: +- pytest 全绿 +- 覆盖率阈值 >= 90% + +--- + +## 10. 已知限制 & 后续可改进 + +- WebUI 当前为最小版,交互与权限提示仍偏“工程化”而非产品化(后续可增强错误提示、搜索筛选、任务详情聚合等)。 +- token 明文持久化仅适合内网/开发场景;生产建议改为一次性展示或支持撤销/轮换策略。 +- SFTPGo 虚拟目录目前保留了历史遗留映射(例如 `/common/models` 可能残留),后续可在升级脚本中做一次性清理与迁移。 + diff --git a/src/mvp/README.md b/src/mvp/README.md index 9758140..4fd720e 100644 --- a/src/mvp/README.md +++ b/src/mvp/README.md @@ -16,3 +16,11 @@ - CLI 提交流程:`scripts/run_all_cli.sh` - API 提交流程:`scripts/run_all_api.sh` - v2.5(Stateless worker + user 隔离 jobs)E2E:`scripts/run_all_v25_api.sh` +- v3.0(WebUI + SFTPGo + user datasets/models + jobs retention)E2E:`scripts/run_all_v30_api.sh` + +v3.0 访问入口(dev/h1): +- WebUI:`http://127.0.0.1:8080/ui` +- Ray Dashboard:`http://127.0.0.1:8265` +- SFTPGo: + - SFTP:`127.0.0.1:2022` + - Admin API/UI:`http://127.0.0.1:8081`(容器内 8080,host 映射到 8081 避免与 API server 冲突) diff --git a/src/mvp/configs/dev.yaml b/src/mvp/configs/dev.yaml index d117f65..7d0f98e 100644 --- a/src/mvp/configs/dev.yaml +++ b/src/mvp/configs/dev.yaml @@ -32,3 +32,24 @@ service: retry_interval_s: 60 max_running_tasks: 1 +# v3.0: user data management (filesystem + SFTPGo) +data: + # All user writable data is placed under this root: + # /private/users//{datasets,models,code,jobs,trash/jobs} + user_root: "/private/users" + + # SFTPGo is optional in dev; when enabled, admin endpoints will call SFTPGo admin API. + # Admin password is provided by env var `data.sftpgo.admin_password_env`. + sftpgo: + enabled: false + host: "" # shown to users via GET /api/v2/me + sftp_port: 2022 + admin_api_base: "" # e.g. http://argus-sftpgo:8080 + admin_user: "admin" + admin_password_env: "SFTPGO_ADMIN_PASSWORD" + + # Jobs retention policy (v3.0 janitor): move to trash after 3d, purge after 7d. + retention: + jobs_trash_after_days: 3 + jobs_purge_after_days: 7 + janitor_interval_s: 3600 diff --git a/src/mvp/configs/dev_v30.yaml b/src/mvp/configs/dev_v30.yaml new file mode 100644 index 0000000..3c2649d --- /dev/null +++ b/src/mvp/configs/dev_v30.yaml @@ -0,0 +1,51 @@ +ray: + # Ray Job server address (head 容器内视角) + address: "http://127.0.0.1:8265" + + # 共享根路径(容器内统一 /private,对齐生产) + shared_root: "/private" + + # 强制 driver 落 worker(head 不跑训练) + entrypoint_num_cpus: 1 + entrypoint_resources: + worker_node: 1 + + # 所有 job 通用 runtime env + runtime_env: + env_vars: + HF_ENDPOINT: "https://hf-mirror.com" + PYTHONUNBUFFERED: "1" + + # v3.0 先不支持 user code 执行 + user_code_path: "/private/user/code" + +service: + api: + host: "0.0.0.0" + port: 8080 + auth: + token_env: "MVP_INTERNAL_TOKEN" + sqlite: + db_path: "/private/common/db/mvp.sqlite3" + scheduler: + tick_s: 5 + retry_interval_s: 60 + max_running_tasks: 1 + +data: + user_root: "/private/users" + sftpgo: + enabled: true + # Returned to users by GET /api/v2/me. For h1 E2E, usually connect to the host IP. + host: "127.0.0.1" + sftp_port: 2022 + # Admin API base should include /api/v2 (SFTPGo OpenAPI server base). + # From head container, access SFTPGo by service name on the compose network. + admin_api_base: "http://argus-sftpgo:8080/api/v2" + admin_user: "admin" + admin_password_env: "SFTPGO_ADMIN_PASSWORD" + retention: + jobs_trash_after_days: 3 + jobs_purge_after_days: 7 + janitor_interval_s: 3600 + diff --git a/src/mvp/docker-compose.yaml b/src/mvp/docker-compose.yaml index a74c1a3..6ed9194 100644 --- a/src/mvp/docker-compose.yaml +++ b/src/mvp/docker-compose.yaml @@ -38,6 +38,36 @@ services: HF_ENDPOINT: "https://hf-mirror.com" PYTHONUNBUFFERED: "1" + # v3.0: Data management service (SFTPGo). + # - SFTP: 2022 + # - Admin API/UI: 8080 (mapped to host 8081 to avoid collision with MVP API server on 8080) + # + # NOTE: This is for dev / h1 E2E. In prod you may use an internal mirror image/tag and different ports. + sftpgo: + image: drakkan/sftpgo:latest + container_name: argus-sftpgo + user: "0:0" + ports: + - "2022:2022" + - "8081:8080" + volumes: + - ../../shared:/private + - ../../shared/common/sftpgo:/var/lib/sftpgo + networks: + - argus-ray-net + environment: + # Create a default admin on first start (used by API server to manage users). + # Override on host as needed: + # export SFTPGO_ADMIN_PASSWORD=... + SFTPGO_DATA_PROVIDER__CREATE_DEFAULT_ADMIN: "true" + # Persist the sqlite DB under the mounted metadata dir; otherwise it defaults to a relative path. + SFTPGO_DATA_PROVIDER__NAME: "/var/lib/sftpgo/sftpgo.db" + SFTPGO_DEFAULT_ADMIN_USERNAME: "admin" + SFTPGO_DEFAULT_ADMIN_PASSWORD: "${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}" + # Explicitly pin default ports via env-var schema (double-underscore for nesting). + SFTPGO_HTTPD__BINDINGS__0__PORT: "8080" + SFTPGO_SFTPD__BINDINGS__0__PORT: "2022" + ray_worker_0: image: argus/argus-ray-node:v2.5 container_name: argus-ray-worker-0 diff --git a/src/mvp/py/argus/service/app.py b/src/mvp/py/argus/service/app.py index 0d7c2db..b09ee5c 100644 --- a/src/mvp/py/argus/service/app.py +++ b/src/mvp/py/argus/service/app.py @@ -1,6 +1,7 @@ from __future__ import annotations import os +import secrets import threading from typing import Any @@ -12,7 +13,10 @@ from argus.ray.models import JobSpec, RayConfig from .config import V2Config from .db import Db +from .janitor import JobsJanitor from .scheduler import Scheduler +from .sftpgo import SFTPGoAdminClient, SFTPGoError +from .ui import register_ui_routes def _utc_now_iso() -> str: @@ -38,11 +42,48 @@ def create_app(config_path: str) -> FastAPI: db.init() scheduler = Scheduler(db=db, ray_cfg=ray_cfg, v2_cfg=v2_cfg) + janitor = JobsJanitor( + db=db, + user_root=v2_cfg.data.user_root, + trash_after_days=v2_cfg.data.retention.jobs_trash_after_days, + purge_after_days=v2_cfg.data.retention.jobs_purge_after_days, + interval_s=v2_cfg.data.retention.janitor_interval_s, + ) stop_flag = threading.Event() tool = scheduler.tool app = FastAPI(title="mvp-v2", version="2.0") + def _user_home(user_id: str) -> str: + base = v2_cfg.data.user_root.rstrip("/") + return f"{base}/{user_id}" + + def _ensure_user_dirs(user_id: str) -> None: + home = _user_home(user_id) + # Note: create a physical /common directory under each user's home so that + # SFTPGo WebClient shows a "common" entry at the root. The actual shared + # content is exposed via SFTPGo virtual folders under /common/{datasets,models}. + for rel in ("datasets", "models", "code", "jobs", "trash/jobs", "common"): + os.makedirs(f"{home}/{rel}", exist_ok=True) + + def _sftpgo_enabled() -> bool: + return bool(v2_cfg.data.sftpgo.enabled) + + def _sftpgo_client() -> SFTPGoAdminClient: + cfg = v2_cfg.data.sftpgo + if not cfg.admin_api_base: + raise HTTPException(status_code=500, detail="sftpgo enabled but data.sftpgo.admin_api_base is empty") + pw = os.environ.get(cfg.admin_password_env, "") + if not pw: + raise HTTPException(status_code=500, detail=f"missing env: {cfg.admin_password_env}") + shared_root = ray_cfg.shared_root.rstrip("/") + return SFTPGoAdminClient( + admin_api_base=str(cfg.admin_api_base), + admin_user=str(cfg.admin_user), + admin_password=pw, + common_root=f"{shared_root}/common", + ) + def _auth(req: Request) -> dict[str, Any]: token_env = v2_cfg.auth.token_env admin_token = os.environ.get(token_env, "") @@ -74,6 +115,9 @@ def create_app(config_path: str) -> FastAPI: def _startup() -> None: t = threading.Thread(target=scheduler.run_forever, args=(stop_flag,), daemon=True) t.start() + if int(janitor.interval_s) > 0: + tj = threading.Thread(target=janitor.run_forever, args=(stop_flag,), daemon=True) + tj.start() @app.on_event("shutdown") def _shutdown() -> None: @@ -93,7 +137,42 @@ def create_app(config_path: str) -> FastAPI: row = db.create_user(user_id=user_id, display_name=str(display_name) if display_name is not None else None) except Exception as e: raise HTTPException(status_code=409, detail=f"user create failed: {e!r}") - return {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")} + + # v3.0: create user dir structure (datasets/models/code/jobs/trash). + try: + _ensure_user_dirs(user_id) + except Exception as e: + raise HTTPException(status_code=500, detail=f"failed to create user dirs: {e!r}") + + out: dict[str, Any] = {"user_id": row.get("user_id", user_id), "state": row.get("state", "ACTIVE")} + + # v3.0: scheme A (password) SFTPGo integration. + if _sftpgo_enabled(): + pw = secrets.token_urlsafe(12) + try: + _sftpgo_client().create_user(username=user_id, password=pw, home_dir=_user_home(user_id)) + except SFTPGoError as e: + # Make create user idempotent for retries: + # If the user already exists in SFTPGo, reset password and enable it. + if "http error: 409" in str(e): + try: + _sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id)) + _sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id)) + except Exception as e2: + raise HTTPException(status_code=502, detail=f"sftpgo upsert user failed: {e2!r}") + else: + raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}") + except HTTPException: + raise + except Exception as e: + raise HTTPException(status_code=502, detail=f"sftpgo create user failed: {e!r}") + out["sftp"] = { + "username": user_id, + "password": pw, # one-time return to admin; do not persist plaintext + "host": v2_cfg.data.sftpgo.host, + "port": int(v2_cfg.data.sftpgo.sftp_port), + } + return out @app.post("/api/v2/users/{user_id}/tokens") async def issue_token(user_id: str, req: Request) -> dict[str, Any]: @@ -104,6 +183,12 @@ def create_app(config_path: str) -> FastAPI: token = db.issue_token(user_id=user_id) return {"user_id": user_id, "token": token} + @app.get("/api/v2/users") + async def list_users(req: Request, limit: int = 200) -> dict[str, Any]: + _require_admin(req) + lim = max(1, min(int(limit), 1000)) + return {"users": db.list_users(limit=lim), "limit": lim} + @app.post("/api/v2/users/{user_id}:disable") async def disable_user(user_id: str, req: Request) -> dict[str, Any]: _require_admin(req) @@ -111,8 +196,122 @@ def create_app(config_path: str) -> FastAPI: if not u: raise HTTPException(status_code=404, detail="user not found") db.disable_user(user_id=user_id) + if _sftpgo_enabled(): + try: + _sftpgo_client().disable_user(username=user_id, home_dir=_user_home(user_id)) + except Exception: + # Best-effort: DB state is source of truth for API auth; SFTPGo sync can lag. + pass return {"user_id": user_id, "state": "DISABLED"} + @app.post("/api/v2/users/{user_id}/sftp:reset_password") + async def reset_sftp_password(user_id: str, req: Request) -> dict[str, Any]: + _require_admin(req) + u = db.get_user(user_id) + if not u: + raise HTTPException(status_code=404, detail="user not found") + if not _sftpgo_enabled(): + raise HTTPException(status_code=400, detail="sftpgo is not enabled") + pw = secrets.token_urlsafe(12) + try: + _sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id)) + except HTTPException: + raise + except Exception as e: + raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}") + return {"user_id": user_id, "password": pw} + + @app.post("/api/v2/me/sftp:reset_password") + async def reset_my_sftp_password(req: Request) -> dict[str, Any]: + """ + v3.0 WebUI: allow a user to rotate their own SFTPGo password and copy it. + Note: SFTPGo does not allow reading the current password, so this endpoint returns a new one-time password. + """ + subject = _auth(req) + user_id = str(subject["user_id"]) + u = db.get_user(user_id) + if not u: + raise HTTPException(status_code=404, detail="user not found") + if not _sftpgo_enabled(): + raise HTTPException(status_code=400, detail="sftpgo is not enabled") + pw = secrets.token_urlsafe(12) + try: + _sftpgo_client().reset_password(username=user_id, new_password=pw, home_dir=_user_home(user_id)) + _sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id)) + except Exception as e: + raise HTTPException(status_code=502, detail=f"sftpgo reset password failed: {e!r}") + return {"user_id": user_id, "password": pw} + + @app.get("/api/v2/me") + async def me(req: Request) -> dict[str, Any]: + subject = _auth(req) + user_id = str(subject["user_id"]) + try: + _ensure_user_dirs(user_id) + except Exception: + # Best-effort: user may still be able to use API even if FS init fails. + pass + # Best-effort: reconcile SFTPGo user to include /common read-only mounts. + # This makes the virtual folders visible without requiring a password reset. + if _sftpgo_enabled() and not subject.get("is_admin"): + try: + _sftpgo_client().enable_user(username=user_id, home_dir=_user_home(user_id)) + except Exception: + pass + home = _user_home(user_id) + out: dict[str, Any] = { + "user_id": user_id, + "is_admin": bool(subject.get("is_admin")), + "paths": { + "home": home, + "datasets": f"{home}/datasets", + "models": f"{home}/models", + "code": f"{home}/code", + "jobs": f"{home}/jobs", + "trash_jobs": f"{home}/trash/jobs", + }, + "retention": { + "jobs_trash_after_days": int(v2_cfg.data.retention.jobs_trash_after_days), + "jobs_purge_after_days": int(v2_cfg.data.retention.jobs_purge_after_days), + }, + } + if _sftpgo_enabled(): + out["sftp"] = { + "host": v2_cfg.data.sftpgo.host, + "port": int(v2_cfg.data.sftpgo.sftp_port), + "username": user_id, + } + return out + + @app.get("/api/v2/tasks") + async def list_tasks(req: Request, limit: int = 200, offset: int = 0, states: str | None = None) -> dict[str, Any]: + subject = _auth(req) + lim = max(1, min(int(limit), 1000)) + off = max(0, int(offset)) + state_list: list[str] | None = None + if states: + raw = [s.strip() for s in str(states).split(",") if s.strip()] + # Keep it strict to avoid typos silently returning empty results. + allowed = { + "QUEUED", + "PENDING_RESOURCES", + "SUBMITTING", + "SUBMITTED", + "RUNNING", + "SUCCEEDED", + "FAILED", + "CANCELED", + } + for s in raw: + if s not in allowed: + raise HTTPException(status_code=400, detail=f"invalid state: {s}") + state_list = raw or None + if subject.get("is_admin"): + tasks = db.list_tasks(user_id=None, states=state_list, limit=lim, offset=off) + else: + tasks = db.list_tasks(user_id=str(subject["user_id"]), states=state_list, limit=lim, offset=off) + return {"tasks": tasks, "limit": lim, "offset": off, "has_more": bool(len(tasks) == lim)} + @app.post("/api/v2/tasks") async def submit_task(req: Request) -> dict[str, Any]: subject = _auth(req) @@ -126,13 +325,40 @@ def create_app(config_path: str) -> FastAPI: except Exception as e: raise HTTPException(status_code=400, detail=f"invalid jobspec: {e!r}") - # v2.5 constraint: training inputs must come from /private/common (dev/prod统一)。 - common_prefix = ray_cfg.shared_root.rstrip("/") + "/common/" - for k, v in (("code_path", spec.code_path), ("train_file", spec.train_file), ("val_file", spec.val_file)): + # v3.0 path policy: + # - code_path: only allow /private/common/... + # - train/val: allow /private/common/datasets/... OR /private/users//datasets/... + # - model_id: if it looks like a local path (/private/...), allow only models dirs: + # /private/common/models/... OR /private/users//models/... + root = ray_cfg.shared_root.rstrip("/") + common_prefix = f"{root}/common/" + user_prefix = f"{root}/users/{str(subject['user_id']).strip()}/" + + common_datasets_prefix = f"{common_prefix}datasets/" + user_datasets_prefix = f"{user_prefix}datasets/" + common_models_prefix = f"{common_prefix}models/" + user_models_prefix = f"{user_prefix}models/" + + if not str(spec.code_path).startswith(common_prefix): + raise HTTPException(status_code=400, detail=f"code_path must start with {common_prefix}") + + for k, v in (("train_file", spec.train_file), ("val_file", spec.val_file)): if v is None: continue - if not str(v).startswith(common_prefix): - raise HTTPException(status_code=400, detail=f"{k} must start with {common_prefix}") + sv = str(v) + if not (sv.startswith(common_datasets_prefix) or sv.startswith(user_datasets_prefix)): + raise HTTPException( + status_code=400, + detail=f"{k} must start with {common_datasets_prefix} or {user_datasets_prefix}", + ) + + model_id = str(spec.model_id) + if model_id.startswith(f"{root}/"): + if not (model_id.startswith(common_models_prefix) or model_id.startswith(user_models_prefix)): + raise HTTPException( + status_code=400, + detail=f"model_id local path must start with {common_models_prefix} or {user_models_prefix}", + ) task_id = new_task_id(spec.workload, user_id=str(subject["user_id"])) db.create_task_v25( @@ -257,4 +483,7 @@ def create_app(config_path: str) -> FastAPI: return db.list_queue() return db.list_queue(user_id=str(subject["user_id"])) + # v3.0: minimal WebUI (no server-side session; token stored in browser localStorage). + register_ui_routes(app) + return app diff --git a/src/mvp/py/argus/service/config.py b/src/mvp/py/argus/service/config.py index 7d3f5dd..3ece213 100644 --- a/src/mvp/py/argus/service/config.py +++ b/src/mvp/py/argus/service/config.py @@ -27,12 +27,37 @@ class V2SchedulerConfig: max_running_tasks: int = 1 +@dataclass(frozen=True) +class V2RetentionConfig: + jobs_trash_after_days: int = 3 + jobs_purge_after_days: int = 7 + janitor_interval_s: int = 3600 + + +@dataclass(frozen=True) +class V2SFTPGoConfig: + enabled: bool = False + host: str = "" + sftp_port: int = 2022 + admin_api_base: str = "" + admin_user: str = "admin" + admin_password_env: str = "SFTPGO_ADMIN_PASSWORD" + + +@dataclass(frozen=True) +class V2DataConfig: + user_root: str + sftpgo: V2SFTPGoConfig + retention: V2RetentionConfig + + @dataclass(frozen=True) class V2Config: api: V2ApiConfig auth: V2AuthConfig sqlite: V2SqliteConfig scheduler: V2SchedulerConfig + data: V2DataConfig @staticmethod def from_root_dict(root: dict[str, Any]) -> "V2Config": @@ -58,9 +83,19 @@ class V2Config: else: shared_root = str(root.get("shared_root") or "/private") + data = root.get("data") or {} + if not isinstance(data, dict): + raise ValueError("config.data must be a mapping") + sftpgo = data.get("sftpgo") or {} + retention = data.get("retention") or {} + if not isinstance(sftpgo, dict) or not isinstance(retention, dict): + raise ValueError("config.data.{sftpgo,retention} must be mappings") + default_db_path = f"{shared_root}/common/db/mvp.sqlite3" db_path = str(sqlite.get("db_path") or default_db_path) + user_root = str(data.get("user_root") or f"{shared_root}/users") + return V2Config( api=V2ApiConfig( host=str(api.get("host") or "0.0.0.0"), @@ -73,4 +108,20 @@ class V2Config: retry_interval_s=int(scheduler.get("retry_interval_s") or 60), max_running_tasks=int(scheduler.get("max_running_tasks") or 1), ), + data=V2DataConfig( + user_root=user_root, + sftpgo=V2SFTPGoConfig( + enabled=bool(sftpgo.get("enabled") or False), + host=str(sftpgo.get("host") or ""), + sftp_port=int(sftpgo.get("sftp_port") or 2022), + admin_api_base=str(sftpgo.get("admin_api_base") or ""), + admin_user=str(sftpgo.get("admin_user") or "admin"), + admin_password_env=str(sftpgo.get("admin_password_env") or "SFTPGO_ADMIN_PASSWORD"), + ), + retention=V2RetentionConfig( + jobs_trash_after_days=int(retention.get("jobs_trash_after_days") or 3), + jobs_purge_after_days=int(retention.get("jobs_purge_after_days") or 7), + janitor_interval_s=int(retention.get("janitor_interval_s") or 3600), + ), + ), ) diff --git a/src/mvp/py/argus/service/db.py b/src/mvp/py/argus/service/db.py index 0aa36d9..8ca314b 100644 --- a/src/mvp/py/argus/service/db.py +++ b/src/mvp/py/argus/service/db.py @@ -40,7 +40,8 @@ class Db: user_id TEXT PRIMARY KEY, display_name TEXT, state TEXT NOT NULL, - created_at TEXT NOT NULL + created_at TEXT NOT NULL, + updated_at TEXT NOT NULL ) """ ) @@ -49,6 +50,7 @@ class Db: CREATE TABLE IF NOT EXISTS api_tokens ( token_hash TEXT PRIMARY KEY, user_id TEXT NOT NULL, + token_plain TEXT NOT NULL, created_at TEXT NOT NULL, last_used_at TEXT, FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE @@ -78,6 +80,15 @@ class Db: conn.execute("ALTER TABLE tasks ADD COLUMN user_id TEXT") except sqlite3.OperationalError: pass + # Best-effort: add missing columns (forward compatibility). + try: + conn.execute("ALTER TABLE users ADD COLUMN updated_at TEXT") + except sqlite3.OperationalError: + pass + try: + conn.execute("ALTER TABLE api_tokens ADD COLUMN token_plain TEXT") + except sqlite3.OperationalError: + pass conn.execute( """ CREATE TABLE IF NOT EXISTS attempts ( @@ -168,10 +179,10 @@ class Db: with self.tx() as conn: conn.execute( """ - INSERT INTO users (user_id, display_name, state, created_at) - VALUES (?, ?, 'ACTIVE', ?) + INSERT INTO users (user_id, display_name, state, created_at, updated_at) + VALUES (?, ?, 'ACTIVE', ?, ?) """, - (user_id, display_name, now), + (user_id, display_name, now, now), ) conn.execute( "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_CREATED', ?)", @@ -183,7 +194,7 @@ class Db: def disable_user(self, *, user_id: str) -> None: now = _utc_now_iso() with self.tx() as conn: - conn.execute("UPDATE users SET state = 'DISABLED' WHERE user_id = ?", (user_id,)) + conn.execute("UPDATE users SET state = 'DISABLED', updated_at = ? WHERE user_id = ?", (now, user_id)) conn.execute( "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'USER_DISABLED', ?)", (now, user_id), @@ -195,15 +206,18 @@ class Db: return dict(row) if row else None def issue_token(self, *, user_id: str) -> str: - # Returns plaintext token once; stores hash only. + # Returns plaintext token once. + # Note: For v3.0 WebUI admin convenience we persist the plaintext token so admins + # can re-copy it later. This is acceptable only for internal/dev deployments. now = _utc_now_iso() token = f"mvp_u_{user_id}_{secrets.token_urlsafe(18)}" token_hash = self._hash_token(token) with self.tx() as conn: conn.execute( - "INSERT INTO api_tokens (token_hash, user_id, created_at) VALUES (?, ?, ?)", - (token_hash, user_id, now), + "INSERT INTO api_tokens (token_hash, user_id, token_plain, created_at) VALUES (?, ?, ?, ?)", + (token_hash, user_id, token, now), ) + conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, user_id)) conn.execute( "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (NULL, ?, 'TOKEN_ISSUED', ?)", (now, user_id), @@ -226,8 +240,48 @@ class Db: return None now = _utc_now_iso() conn.execute("UPDATE api_tokens SET last_used_at = ? WHERE token_hash = ?", (now, token_hash)) + conn.execute("UPDATE users SET updated_at = ? WHERE user_id = ?", (now, str(row["user_id"]))) return {"user_id": row["user_id"], "state": row["state"]} + def list_users(self, limit: int = 200) -> list[dict[str, Any]]: + with self._connect() as conn: + rows = conn.execute( + """ + SELECT + u.user_id, + u.display_name, + u.state, + u.created_at, + u.updated_at, + ( + SELECT t.token_plain + FROM api_tokens t + WHERE t.user_id = u.user_id + ORDER BY t.created_at DESC + LIMIT 1 + ) AS token, + ( + SELECT t.created_at + FROM api_tokens t + WHERE t.user_id = u.user_id + ORDER BY t.created_at DESC + LIMIT 1 + ) AS token_created_at, + ( + SELECT t.last_used_at + FROM api_tokens t + WHERE t.user_id = u.user_id + ORDER BY t.created_at DESC + LIMIT 1 + ) AS token_last_used_at + FROM users u + ORDER BY u.created_at DESC + LIMIT ? + """, + (int(limit),), + ).fetchall() + return [dict(r) for r in rows] + def get_task(self, task_id: str) -> dict[str, Any] | None: with self._connect() as conn: row = conn.execute("SELECT * FROM tasks WHERE task_id = ?", (task_id,)).fetchone() @@ -269,6 +323,40 @@ class Db: running = conn.execute(running_sql, tuple(params)).fetchall() return {"pending": [dict(r) for r in pending], "running": [dict(r) for r in running]} + def list_tasks( + self, + *, + user_id: str | None = None, + states: list[str] | None = None, + limit: int = 200, + offset: int = 0, + ) -> list[dict[str, Any]]: + """ + Returns recent tasks (including terminal tasks). + """ + with self._connect() as conn: + params: list[Any] = [] + where_clauses: list[str] = [] + if user_id is not None: + where_clauses.append("user_id = ?") + params.append(user_id) + if states: + placeholders = ",".join(["?"] * len(states)) + where_clauses.append(f"state IN ({placeholders})") + params.extend(states) + where_sql = f" WHERE {' AND '.join(where_clauses)}" if where_clauses else "" + sql = ( + "SELECT task_id, user_id, workload, state, nnodes, n_gpus_per_node, latest_attempt_no, created_at, updated_at, error_summary " + "FROM tasks" + f"{where_sql} " + "ORDER BY created_at DESC " + "LIMIT ? OFFSET ?" + ) + params.append(int(limit)) + params.append(max(0, int(offset))) + rows = conn.execute(sql, tuple(params)).fetchall() + return [dict(r) for r in rows] + def count_running(self) -> int: with self._connect() as conn: row = conn.execute( @@ -383,3 +471,25 @@ class Db: "INSERT INTO events (task_id, ts, event_type, payload_json) VALUES (?, ?, 'ATTEMPT_UPDATE', ?)", (task_id, now, None), ) + + def list_ended_attempts_before(self, *, end_time_le: str, limit: int = 2000) -> list[dict[str, Any]]: + """ + Used by the jobs janitor: + - Only considers tasks with a non-null user_id (v2.5+). + - Returns attempts that have end_time <= end_time_le. + """ + with self._connect() as conn: + rows = conn.execute( + """ + SELECT t.task_id, t.user_id, a.ray_submission_id, a.end_time + FROM attempts a + JOIN tasks t ON t.task_id = a.task_id + WHERE t.user_id IS NOT NULL + AND a.end_time IS NOT NULL + AND a.end_time <= ? + ORDER BY a.end_time ASC + LIMIT ? + """, + (str(end_time_le), int(limit)), + ).fetchall() + return [dict(r) for r in rows] diff --git a/src/mvp/py/argus/service/janitor.py b/src/mvp/py/argus/service/janitor.py new file mode 100644 index 0000000..61a5e59 --- /dev/null +++ b/src/mvp/py/argus/service/janitor.py @@ -0,0 +1,105 @@ +from __future__ import annotations + +import os +import shutil +import time +from dataclasses import dataclass +from datetime import datetime, timedelta, timezone + +from .db import Db + + +def _parse_iso_z(ts: str) -> datetime: + # Stored as "YYYY-MM-DDTHH:MM:SSZ" (naive UTC). Parse into aware UTC. + return datetime.fromisoformat(ts.replace("Z", "+00:00")).astimezone(timezone.utc) + + +def _iso_z(dt: datetime) -> str: + return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z") + + +@dataclass +class JobsJanitor: + db: Db + user_root: str + trash_after_days: int = 3 + purge_after_days: int = 7 + interval_s: int = 3600 + + def __post_init__(self) -> None: + if int(self.trash_after_days) < 0 or int(self.purge_after_days) < 0: + raise ValueError("retention days must be non-negative") + if int(self.purge_after_days) and int(self.purge_after_days) < int(self.trash_after_days): + raise ValueError("purge_after_days must be >= trash_after_days") + + def _job_dir(self, *, user_id: str, ray_submission_id: str) -> str: + base = self.user_root.rstrip("/") + return f"{base}/{user_id}/jobs/{ray_submission_id}" + + def _trash_dir(self, *, user_id: str, ray_submission_id: str) -> str: + base = self.user_root.rstrip("/") + return f"{base}/{user_id}/trash/jobs/{ray_submission_id}" + + def tick_once(self, *, now: datetime | None = None, limit: int = 2000) -> None: + if int(self.trash_after_days) <= 0 and int(self.purge_after_days) <= 0: + return + + now_dt = now or datetime.now(timezone.utc) + move_cutoff = now_dt - timedelta(days=int(self.trash_after_days)) + move_cutoff_iso = _iso_z(move_cutoff) + + rows = self.db.list_ended_attempts_before(end_time_le=move_cutoff_iso, limit=int(limit)) + for r in rows: + user_id = str(r.get("user_id") or "").strip() + sid = str(r.get("ray_submission_id") or "").strip() + end_time = str(r.get("end_time") or "").strip() + if not user_id or not sid or not end_time: + continue + + try: + ended_at = _parse_iso_z(end_time) + except Exception: + continue + + age_days = (now_dt - ended_at).total_seconds() / 86400.0 + src = self._job_dir(user_id=user_id, ray_submission_id=sid) + dst = self._trash_dir(user_id=user_id, ray_submission_id=sid) + dst_parent = os.path.dirname(dst) + + # Between trash and purge: ensure in trash. + if age_days >= float(self.trash_after_days) and age_days < float(self.purge_after_days or 10**9): + if os.path.exists(src) and not os.path.exists(dst): + os.makedirs(dst_parent, exist_ok=True) + try: + shutil.move(src, dst) + except Exception: + pass + continue + + # Purge: move to trash (if still in jobs) then delete from trash. + if int(self.purge_after_days) > 0 and age_days >= float(self.purge_after_days): + if os.path.exists(src) and not os.path.exists(dst): + os.makedirs(dst_parent, exist_ok=True) + try: + shutil.move(src, dst) + except Exception: + pass + if os.path.exists(dst): + try: + shutil.rmtree(dst) + except Exception: + pass + + def run_forever(self, stop_flag: object) -> None: + try: + is_set = stop_flag.is_set # type: ignore[attr-defined] + except Exception: + raise ValueError("stop_flag must be a threading.Event-like object with is_set()") + + while not is_set(): + try: + self.tick_once() + except Exception: + pass + time.sleep(max(1, int(self.interval_s))) + diff --git a/src/mvp/py/argus/service/sftpgo.py b/src/mvp/py/argus/service/sftpgo.py new file mode 100644 index 0000000..09b2681 --- /dev/null +++ b/src/mvp/py/argus/service/sftpgo.py @@ -0,0 +1,221 @@ +from __future__ import annotations + +import base64 +import json +from dataclasses import dataclass +from urllib.error import HTTPError, URLError +from urllib.request import Request, urlopen + + +class SFTPGoError(RuntimeError): + pass + + +@dataclass(frozen=True) +class SFTPGoAdminClient: + """ + Minimal SFTPGo admin API client (v3.0). + + SFTPGo OpenAPI documents the admin token flow: + - GET /api/v2/token with HTTP BasicAuth -> returns {"access_token": "..."}. + - Use Authorization: Bearer for admin endpoints like POST /api/v2/users. + + See upstream OpenAPI for details: + https://raw.githubusercontent.com/drakkan/sftpgo/main/openapi/openapi.yaml + """ + + admin_api_base: str + admin_user: str + admin_password: str + common_root: str = "/private/common" + + def _url(self, path: str) -> str: + base = self.admin_api_base.rstrip("/") + p = path if path.startswith("/") else f"/{path}" + return f"{base}{p}" + + def _basic_auth_header(self) -> str: + raw = f"{self.admin_user}:{self.admin_password}".encode("utf-8") + return "Basic " + base64.b64encode(raw).decode("ascii") + + def _get_json(self, url: str, headers: dict[str, str]) -> dict: + req = Request(url, headers=headers, method="GET") + try: + with urlopen(req, timeout=10) as resp: + return json.loads(resp.read().decode("utf-8")) + except HTTPError as e: + raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e + except URLError as e: + raise SFTPGoError(f"sftpgo connection error: {e!r}") from e + + def _post_json(self, url: str, payload: dict, headers: dict[str, str]) -> None: + data = json.dumps(payload).encode("utf-8") + req = Request(url, data=data, headers=headers, method="POST") + try: + with urlopen(req, timeout=10) as resp: + _ = resp.read() + except HTTPError as e: + raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e + except URLError as e: + raise SFTPGoError(f"sftpgo connection error: {e!r}") from e + + def _put_json(self, url: str, payload: dict, headers: dict[str, str]) -> None: + data = json.dumps(payload).encode("utf-8") + req = Request(url, data=data, headers=headers, method="PUT") + try: + with urlopen(req, timeout=10) as resp: + _ = resp.read() + except HTTPError as e: + raise SFTPGoError(f"sftpgo http error: {e.code} {e.reason}") from e + except URLError as e: + raise SFTPGoError(f"sftpgo connection error: {e!r}") from e + + def _admin_token(self) -> str: + url = self._url("/token") + obj = self._get_json(url, headers={"Authorization": self._basic_auth_header()}) + tok = str(obj.get("access_token") or "").strip() + if not tok: + raise SFTPGoError("sftpgo token response missing access_token") + return tok + + def _auth_headers(self, tok: str) -> dict[str, str]: + return {"Authorization": f"Bearer {tok}", "Content-Type": "application/json"} + + def _ensure_folder(self, *, tok: str, name: str, mapped_path: str) -> None: + url = self._url("/folders") + try: + self._post_json(url, {"name": name, "mapped_path": mapped_path}, headers=self._auth_headers(tok)) + except SFTPGoError as e: + # Idempotent + self-healing: + # If it already exists, update mapped_path to the desired value. + if "409" in str(e): + self._put_json( + self._url(f"/folders/{name}"), + {"name": name, "mapped_path": mapped_path}, + headers=self._auth_headers(tok), + ) + return + raise + + def _ensure_common_folders(self, *, tok: str) -> None: + # Important: do NOT map datasets to /private/common/datasets because that path is + # a symlink farm (e.g. gsm8k -> /private/datasets/gsm8k) and SFTPGo WebClient + # will treat symlink traversal as escaping the virtual folder root, resulting + # in "permission denied". Map directly to the real datasets root. + common = self.common_root.rstrip("/") + if common.endswith("/common"): + shared_root = common[: -len("/common")] or "/private" + else: + shared_root = "/private" + + self._ensure_folder(tok=tok, name="common_datasets", mapped_path=f"{shared_root}/datasets") + # Expose HF cache read-only so users can inspect downloaded models/datasets. + self._ensure_folder(tok=tok, name="common_hf", mapped_path=f"{shared_root}/hf") + + def _get_user(self, *, tok: str, username: str) -> dict: + url = self._url(f"/users/{username}") + return self._get_json(url, headers={"Authorization": f"Bearer {tok}"}) + + def _put_user(self, *, tok: str, username: str, payload: dict) -> None: + url = self._url(f"/users/{username}") + self._put_json(url, payload, headers=self._auth_headers(tok)) + + def _apply_common_readonly(self, user_payload: dict) -> dict: + # Path-based permissions: make /common/* read-only while keeping home writeable. + perms = dict(user_payload.get("permissions") or {"/": ["*"]}) + # Ensure /common is visible as a directory and can be traversed. + perms["/common"] = ["list"] + perms["/common/datasets"] = ["list", "download"] + perms["/common/hf"] = ["list", "download"] + user_payload["permissions"] = perms + + desired_vf = [ + {"name": "common_datasets", "virtual_path": "/common/datasets"}, + {"name": "common_hf", "virtual_path": "/common/hf"}, + ] + existing = user_payload.get("virtual_folders") or [] + if not isinstance(existing, list): + existing = [] + seen = {(vf.get("name"), vf.get("virtual_path")) for vf in existing if isinstance(vf, dict)} + merged = list(existing) + for vf in desired_vf: + key = (vf["name"], vf["virtual_path"]) + if key not in seen: + merged.append(vf) + user_payload["virtual_folders"] = merged + return user_payload + + def create_user(self, *, username: str, password: str, home_dir: str) -> None: + tok = self._admin_token() + self._ensure_common_folders(tok=tok) + url = self._url("/users") + payload: dict = { + "status": 1, + "username": username, + "password": password, + "home_dir": home_dir, + "permissions": { + "/": ["*"], + "/common": ["list"], + "/common/datasets": ["list", "download"], + "/common/hf": ["list", "download"], + }, + "virtual_folders": [ + {"name": "common_datasets", "virtual_path": "/common/datasets"}, + {"name": "common_hf", "virtual_path": "/common/hf"}, + ], + } + self._post_json( + url, + payload, + headers=self._auth_headers(tok), + ) + + def disable_user(self, *, username: str, home_dir: str) -> None: + tok = self._admin_token() + self._ensure_common_folders(tok=tok) + cur = self._get_user(tok=tok, username=username) + payload: dict = { + "username": username, + "status": 0, + "home_dir": home_dir, + "uid": cur.get("uid", 0), + "gid": cur.get("gid", 0), + "permissions": cur.get("permissions") or {"/": ["*"]}, + "virtual_folders": cur.get("virtual_folders") or [], + } + self._apply_common_readonly(payload) + self._put_user(tok=tok, username=username, payload=payload) + + def enable_user(self, *, username: str, home_dir: str) -> None: + tok = self._admin_token() + self._ensure_common_folders(tok=tok) + cur = self._get_user(tok=tok, username=username) + payload: dict = { + "username": username, + "status": 1, + "home_dir": home_dir, + "uid": cur.get("uid", 0), + "gid": cur.get("gid", 0), + "permissions": cur.get("permissions") or {"/": ["*"]}, + "virtual_folders": cur.get("virtual_folders") or [], + } + self._apply_common_readonly(payload) + self._put_user(tok=tok, username=username, payload=payload) + + def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None: + tok = self._admin_token() + self._ensure_common_folders(tok=tok) + cur = self._get_user(tok=tok, username=username) + payload: dict = { + "username": username, + "status": 1, + "home_dir": home_dir, + "uid": cur.get("uid", 0), + "gid": cur.get("gid", 0), + "permissions": cur.get("permissions") or {"/": ["*"]}, + "virtual_folders": cur.get("virtual_folders") or [], + "password": new_password, + } + self._apply_common_readonly(payload) + self._put_user(tok=tok, username=username, payload=payload) diff --git a/src/mvp/py/argus/service/ui.py b/src/mvp/py/argus/service/ui.py new file mode 100644 index 0000000..27de826 --- /dev/null +++ b/src/mvp/py/argus/service/ui.py @@ -0,0 +1,629 @@ +from __future__ import annotations + +import html +import json + +from fastapi import FastAPI +from fastapi.responses import HTMLResponse, RedirectResponse + + +_BASE_CSS = """ +:root { --bg:#0b1020; --panel:#111a33; --muted:#95a3c6; --fg:#e8eeff; --accent:#7aa2ff; --danger:#ff6b6b; --ok:#3ddc97; } +* { box-sizing: border-box; } +body { margin:0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background:var(--bg); color:var(--fg); } +a { color:var(--accent); text-decoration:none; } +.layout { display:flex; min-height:100vh; } +.nav { width: 240px; padding:16px; background: linear-gradient(180deg, #0e1630, #0b1020); border-right: 1px solid rgba(255,255,255,0.06); } +.brand { font-weight: 700; letter-spacing: .2px; margin-bottom: 12px; } +.nav a { display:block; padding:10px 10px; border-radius:10px; color: var(--fg); opacity: .9; } +.nav a.active { background: rgba(122,162,255,0.14); border: 1px solid rgba(122,162,255,0.22); } +.nav a:hover { background: rgba(255,255,255,0.06); } +.main { flex:1; padding: 20px 24px; } +.card { background: rgba(255,255,255,0.04); border: 1px solid rgba(255,255,255,0.08); border-radius: 14px; padding: 16px; } +.row { display:flex; gap: 12px; align-items:center; flex-wrap: wrap; } +.muted { color: var(--muted); } +.btn { border: 1px solid rgba(255,255,255,0.16); background: rgba(255,255,255,0.06); color: var(--fg); padding: 10px 12px; border-radius: 10px; cursor: pointer; } +.btn:hover { background: rgba(255,255,255,0.10); } +.btn.danger { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.10); } +.pill { display:inline-block; padding: 2px 10px; border-radius: 999px; border: 1px solid rgba(255,255,255,0.16); font-size: 12px; } +.pill.ok { border-color: rgba(61,220,151,0.35); background: rgba(61,220,151,0.12); } +.pill.bad { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.12); } +textarea, input { width: 100%; color: var(--fg); background: rgba(255,255,255,0.06); border: 1px solid rgba(255,255,255,0.12); border-radius: 12px; padding: 10px 12px; outline: none; } +button:disabled { opacity: .45; cursor: not-allowed; } +pre { white-space: pre-wrap; word-break: break-word; } +table { width:100%; border-collapse: collapse; } +th, td { padding: 10px 8px; border-bottom: 1px solid rgba(255,255,255,0.08); text-align:left; } +""".strip() + + +_BASE_JS = """ +function mvpTokenGet() { + return (localStorage.getItem("mvp_token") || "").trim(); +} +function mvpTokenSet(v) { + localStorage.setItem("mvp_token", (v || "").trim()); +} +function mvpSftpPasswordGet() { + return (localStorage.getItem("mvp_sftp_password") || "").trim(); +} +function mvpSftpPasswordSet(v) { + localStorage.setItem("mvp_sftp_password", (v || "").trim()); +} +async function apiFetch(path, opts) { + opts = opts || {}; + opts.headers = opts.headers || {}; + const tok = mvpTokenGet(); + if (tok) opts.headers["Authorization"] = "Bearer " + tok; + return fetch(path, opts); +} +async function apiJson(path, opts) { + const resp = await apiFetch(path, opts); + const text = await resp.text(); + if (!resp.ok) { + const err = new Error("HTTP " + resp.status); + err.status = resp.status; + err.body = text; + throw err; + } + return JSON.parse(text); +} +function fmtJson(obj) { + try { return JSON.stringify(obj, null, 2); } catch (e) { return String(obj); } +} +function curOriginWithPort(port) { + const proto = window.location.protocol; + const host = window.location.hostname; + return proto + "//" + host + ":" + port; +} +async function copyText(v) { + if (!v) return false; + try { + await navigator.clipboard.writeText(v); + return true; + } catch (e) { + // Fallback for non-secure contexts (http) or older browsers. + try { + const ta = document.createElement("textarea"); + ta.value = v; + ta.style.position = "fixed"; + ta.style.opacity = "0"; + document.body.appendChild(ta); + ta.focus(); + ta.select(); + const ok = document.execCommand("copy"); + document.body.removeChild(ta); + return ok; + } catch (e2) { + return false; + } + } +} +document.addEventListener("DOMContentLoaded", () => { + const el = document.getElementById("nav-ray-dashboard"); + if (el) el.href = curOriginWithPort(8265); +}); +""".strip() + + +def _nav(active: str) -> str: + links = [ + ("login", "/ui/login", "Login"), + ("tasks", "/ui/tasks", "Tasks"), + ("new", "/ui/tasks/new", "New Task"), + ("data", "/ui/data", "Data"), + ("admin", "/ui/admin", "Admin"), + ("ray", "#", "Ray Dashboard"), + ] + items = [] + for key, href, label in links: + cls = "active" if key == active else "" + extra = "" + if key == "ray": + extra = ' id="nav-ray-dashboard" target="_blank" rel="noopener"' + items.append(f'{html.escape(label)}') + return "\n".join(items) + + +def _page(title: str, active: str, body: str, script: str = "") -> str: + return f""" + + + + + {html.escape(title)} + + + +
+ +
+ {body} +
+
+ + + +""" + + +def register_ui_routes(app: FastAPI) -> None: + @app.get("/ui") + async def ui_root() -> RedirectResponse: + return RedirectResponse(url="/ui/tasks") + + @app.get("/ui/login") + async def ui_login() -> HTMLResponse: + body = """ +

Login

+
+
Paste your API token (without the Bearer prefix).
+
+ +
+
+ + + Go to Tasks +
+
+
+
+""".strip() + script = """ +const tokEl = document.getElementById("tok"); +const msg = document.getElementById("msg"); +tokEl.value = mvpTokenGet(); +document.getElementById("save").onclick = () => { mvpTokenSet(tokEl.value); msg.textContent = "Saved."; }; +document.getElementById("clear").onclick = () => { mvpTokenSet(""); tokEl.value = ""; msg.textContent = "Cleared."; }; +""".strip() + return HTMLResponse(content=_page("Login", "login", body, script)) + + @app.get("/ui/tasks") + async def ui_tasks() -> HTMLResponse: + body = """ +

Tasks

+
+
+ + New Task +
+
+
Loading...
+
+""".strip() + script = """ +const out = document.getElementById("out"); +async function refresh() { + out.textContent = "Loading..."; + try { + const q = await apiJson("/api/v2/queue"); + const completedLimit = 25; + const completedOffset = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0; + const done = await apiJson("/api/v2/tasks?limit=" + completedLimit + "&offset=" + completedOffset + "&states=SUCCEEDED,FAILED,CANCELED"); + + function pill(state) { + const s = String(state || ""); + if (s === "SUCCEEDED") return `${s}`; + if (s === "FAILED") return `${s}`; + if (s === "CANCELED") return `${s}`; + if (s === "RUNNING") return `${s}`; + if (s === "QUEUED" || s === "PENDING_RESOURCES" || s === "SUBMITTING" || s === "SUBMITTED") return `${s}`; + return `${s}`; + } + function row(t) { + const id = t.task_id; + return ` + ${id} + ${t.workload} + ${pill(t.state)} + ${t.nnodes} x ${t.n_gpus_per_node} GPU + ${t.updated_at || ""} + `; + } + + const running = (q.running || []).map(row).join(""); + const pending = (q.pending || []).map(row).join(""); + const doneRows = (done.tasks || []).map(row).join(""); + const pageNo = Math.floor(completedOffset / completedLimit) + 1; + const prevDisabled = completedOffset <= 0; + const nextDisabled = !done.has_more; + out.innerHTML = ` +
Tip: configure token in Login.
+
+

Running

+ ${running || ""}
TaskWorkloadStateResourcesUpdated
(none)
+
+

Pending

+ ${pending || ""}
TaskWorkloadStateResourcesUpdated
(none)
+
+

Completed

+
+
Page ${pageNo}
+
+ + +
+
+ ${doneRows || ""}
TaskWorkloadStateResourcesUpdated
(none)
+ `; + + const prevBtn = document.getElementById("done-prev"); + const nextBtn = document.getElementById("done-next"); + if (prevBtn) prevBtn.onclick = () => { + const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0; + const next = Math.max(0, cur - completedLimit); + localStorage.setItem("mvp_completed_offset", String(next)); + refresh(); + }; + if (nextBtn) nextBtn.onclick = () => { + const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0; + const next = cur + completedLimit; + localStorage.setItem("mvp_completed_offset", String(next)); + refresh(); + }; + } catch (e) { + out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e)); + } +} +document.getElementById("refresh").onclick = refresh; +refresh(); +""".strip() + return HTMLResponse(content=_page("Tasks", "tasks", body, script)) + + @app.get("/ui/tasks/new") + async def ui_new_task() -> HTMLResponse: + ppo = """# PPO TaskSpec (YAML) +workload: ppo +nnodes: 2 +n_gpus_per_node: 4 +code_path: /private/common/code/verl/verl_repo +train_file: /private/common/datasets/gsm8k/train.parquet +val_file: /private/common/datasets/gsm8k/test.parquet +model_id: Qwen/Qwen2.5-0.5B-Instruct +""".strip() + grpo = """# GRPO TaskSpec (YAML) +workload: grpo +nnodes: 2 +n_gpus_per_node: 4 +code_path: /private/common/code/verl/verl_repo +train_file: /private/common/datasets/gsm8k/train.parquet +val_file: /private/common/datasets/gsm8k/test.parquet +model_id: Qwen/Qwen2.5-0.5B-Instruct +""".strip() + sft = """# SFT TaskSpec (YAML) +workload: sft +nnodes: 1 +n_gpus_per_node: 1 +code_path: /private/common/code/verl/verl_repo +train_file: /private/common/datasets/gsm8k_sft/train.parquet +val_file: /private/common/datasets/gsm8k_sft/test.parquet +model_id: Qwen/Qwen2.5-0.5B-Instruct +""".strip() + body = f""" +

New Task

+
+
Paste TaskSpec YAML and submit to API server. Note: code_path is required (v3.0 does not execute user code; use the common snapshot).
+
+
+ + + +
+
+ +
+
+ + Back +
+
+

+
+""".strip() + tpl_ppo = json.dumps(ppo) + tpl_grpo = json.dumps(grpo) + tpl_sft = json.dumps(sft) + script = ( + """ +const msg = document.getElementById("msg"); +const yamlEl = document.getElementById("yaml"); +const TPL_PPO = __TPL_PPO__; +const TPL_GRPO = __TPL_GRPO__; +const TPL_SFT = __TPL_SFT__; +document.getElementById("tpl-ppo").onclick = () => { yamlEl.value = TPL_PPO; msg.textContent = ""; }; +document.getElementById("tpl-grpo").onclick = () => { yamlEl.value = TPL_GRPO; msg.textContent = ""; }; +document.getElementById("tpl-sft").onclick = () => { yamlEl.value = TPL_SFT; msg.textContent = ""; }; +document.getElementById("submit").onclick = async () => { + msg.textContent = "Submitting..."; + const body = yamlEl.value; + const resp = await apiFetch("/api/v2/tasks", { method: "POST", headers: {"Content-Type":"text/plain"}, body }); + const text = await resp.text(); + if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; } + const obj = JSON.parse(text); + msg.textContent = "OK: " + fmtJson(obj); + if (obj.task_id) window.location.href = "/ui/tasks/" + obj.task_id; +}; +""".strip() + .replace("__TPL_PPO__", tpl_ppo) + .replace("__TPL_GRPO__", tpl_grpo) + .replace("__TPL_SFT__", tpl_sft) + ) + return HTMLResponse(content=_page("New Task", "new", body, script)) + + @app.get("/ui/tasks/{task_id}") + async def ui_task_detail(task_id: str) -> HTMLResponse: + safe_id = html.escape(task_id) + body = f""" +

Task: {safe_id}

+
+
+ Logs + + + Back +
+
+
Loading...
+
+""".strip() + script = f""" +document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265); +const out = document.getElementById("out"); +async function refresh() {{ + out.textContent = "Loading..."; + const resp = await apiFetch("/api/v2/tasks/{task_id}"); + const text = await resp.text(); + if (!resp.ok) {{ out.textContent = "Error: " + resp.status + "\\n" + text; return; }} + out.textContent = fmtJson(JSON.parse(text)); +}} +document.getElementById("refresh").onclick = refresh; +document.getElementById("cancel").onclick = async () => {{ + if (!confirm("Cancel this task?")) return; + const resp = await apiFetch("/api/v2/tasks/{task_id}:cancel", {{ method: "POST" }}); + const text = await resp.text(); + out.textContent = (resp.ok ? "Canceled.\\n" : "Error: " + resp.status + "\\n") + text; + setTimeout(refresh, 800); +}}; +refresh(); +""".strip() + return HTMLResponse(content=_page(f"Task {task_id}", "tasks", body, script)) + + @app.get("/ui/tasks/{task_id}/logs") + async def ui_task_logs(task_id: str) -> HTMLResponse: + safe_id = html.escape(task_id) + body = f""" +

Logs: {safe_id}

+
+
+ + + Back +
+
+
Loading...
+
+""".strip() + script = f""" +document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265); +const out = document.getElementById("out"); +let timer = null; +async function refresh() {{ + const resp = await apiFetch("/api/v2/tasks/{task_id}/logs?tail=4000"); + const text = await resp.text(); + out.textContent = resp.ok ? text : ("Error: " + resp.status + "\\n" + text); +}} +document.getElementById("refresh").onclick = refresh; +document.getElementById("auto").onchange = (e) => {{ + if (e.target.checked) {{ + timer = setInterval(refresh, 2000); + }} else {{ + if (timer) clearInterval(timer); + timer = null; + }} +}}; +refresh(); +""".strip() + return HTMLResponse(content=_page(f"Logs {task_id}", "tasks", body, script)) + + @app.get("/ui/data") + async def ui_data() -> HTMLResponse: + body = """ +

Data

+
+
User files live under your home directory. Keep long-term artifacts in models/ or datasets/.
+
+ +
+
+
Username
+
+
+ + +
+
+
+
SFTPGo password
+
+
+ + + +
+
+
+ +
+ + +
+
+ You can also use an SFTP client (e.g. FileZilla) with the same username/password. + Host: , Port: . +
+ +
+
Loading...
+
+""".strip() + script = """ +const out = document.getElementById("out"); +document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265); +const u = document.getElementById("u"); +const p = document.getElementById("p"); +const sftpWeb = document.getElementById("sftp-web"); +const sftpHost = document.getElementById("sftp-host"); +const sftpPort = document.getElementById("sftp-port"); +document.getElementById("copy-u").onclick = async () => { await copyText(u.value || ""); }; +document.getElementById("copy-p").onclick = async () => { await copyText(p.value || ""); }; + +async function refresh() { + const resp = await apiFetch("/api/v2/me"); + const text = await resp.text(); + if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; } + const obj = JSON.parse(text); + u.value = (obj.user_id || ""); + const cached = mvpSftpPasswordGet(); + if (cached) p.value = cached; + const host = curOriginWithPort(8081); + sftpWeb.href = host + "/web/client"; + sftpHost.textContent = (obj.sftp && obj.sftp.host) ? obj.sftp.host : window.location.hostname; + sftpPort.textContent = (obj.sftp && obj.sftp.port) ? String(obj.sftp.port) : "2022"; + out.textContent = fmtJson(obj); +} +document.getElementById("reset-p").onclick = async () => { + p.value = ""; + const resp = await apiFetch("/api/v2/me/sftp:reset_password", { method: "POST" }); + const text = await resp.text(); + if (!resp.ok) { out.textContent = "Error: " + resp.status + "\\n" + text; return; } + const obj = JSON.parse(text); + p.value = obj.password || ""; + mvpSftpPasswordSet(p.value); + out.textContent = "SFTPGo password rotated.\\n\\n" + fmtJson(obj); +}; +refresh(); +""".strip() + return HTMLResponse(content=_page("Data", "data", body, script)) + + @app.get("/ui/admin") + async def ui_admin() -> HTMLResponse: + body = """ +

Admin

+
+
+ This page requires the admin token (set it in Login). +
+
+ +

Create user

+
+ + + +
+
+

+
+  
+
+ +
+
+
Loading...
+
+""".strip() + script = """ +const out = document.getElementById("out"); +const createMsg = document.getElementById("create-msg"); +const userIdEl = document.getElementById("new-user-id"); +const displayNameEl = document.getElementById("new-display-name"); + +function esc(s) { + s = String(s || ""); + return s.replaceAll("&","&").replaceAll("<","<").replaceAll(">",">"); +} + +async function refresh() { + out.textContent = "Loading..."; + try { + const obj = await apiJson("/api/v2/users?limit=200"); + const users = (obj.users || []); + function row(u) { + const uid = u.user_id; + const tok = u.token || ""; + const tokShort = tok ? (tok.length > 18 ? (tok.slice(0, 18) + "…") : tok) : ""; + const created = u.created_at || ""; + const updated = u.updated_at || ""; + const tCreated = u.token_created_at || ""; + const tUsed = u.token_last_used_at || ""; + return ` + ${esc(uid)} + ${esc(created)} + ${esc(updated)} + +
+ ${esc(tokShort)} + + +
+
token_created_at: ${esc(tCreated)}; last_used_at: ${esc(tUsed)}
+ + `; + } + out.innerHTML = ` + + + ${users.map(row).join("") || ""} +
UserCreatedUpdatedToken
(none)
+ `; + for (const btn of out.querySelectorAll("button[data-copy]")) { + btn.onclick = async () => { await copyText(btn.getAttribute("data-copy") || ""); }; + } + for (const btn of out.querySelectorAll("button[data-issue]")) { + btn.onclick = async () => { + const uid = btn.getAttribute("data-issue"); + if (!uid) return; + try { + const r = await apiJson("/api/v2/users/" + encodeURIComponent(uid) + "/tokens", { method: "POST" }); + createMsg.textContent = "Issued token:\\n" + fmtJson(r); + await refresh(); + } catch (e) { + createMsg.textContent = "Error issuing token: " + (e.status || "") + "\\n" + (e.body || String(e)); + } + }; + } + } catch (e) { + out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e)); + } +} + +document.getElementById("refresh").onclick = refresh; +document.getElementById("create-user").onclick = async () => { + createMsg.textContent = "Creating..."; + const user_id = (userIdEl.value || "").trim(); + const display_name = (displayNameEl.value || "").trim(); + if (!user_id) { createMsg.textContent = "user_id is required"; return; } + const payload = { user_id: user_id }; + if (display_name) payload.display_name = display_name; + try { + const r = await apiJson("/api/v2/users", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(payload) }); + createMsg.textContent = "Created:\\n" + fmtJson(r); + userIdEl.value = ""; + displayNameEl.value = ""; + await refresh(); + } catch (e) { + createMsg.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e)); + } +}; + +refresh(); +""".strip() + return HTMLResponse(content=_page("Admin", "admin", body, script)) diff --git a/src/mvp/py/tests/test_app.py b/src/mvp/py/tests/test_app.py index 99e2243..b96bf77 100644 --- a/src/mvp/py/tests/test_app.py +++ b/src/mvp/py/tests/test_app.py @@ -16,6 +16,11 @@ def _write_config(tmp_path: Path) -> Path: "entrypoint_resources": {"worker_node": 1}, "runtime_env": {"env_vars": {}}, }, + "data": { + # Avoid touching real /private in tests. Keep ray.shared_root as /private + # so existing path validation tests remain unchanged. + "user_root": str(tmp_path / "users"), + }, "service": { "api": {"host": "127.0.0.1", "port": 0}, "auth": {"token_env": "MVP_INTERNAL_TOKEN"}, @@ -95,6 +100,17 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch): assert r3.status_code == 200 assert "pending" in r3.json() + r3b = c.get("/api/v2/tasks?limit=10", headers=headers) + assert r3b.status_code == 200 + assert any(t.get("task_id") == "tid1" for t in r3b.json().get("tasks", [])) + + r3c = c.get("/api/v2/tasks?limit=10&offset=0&states=QUEUED", headers=headers) + assert r3c.status_code == 200 + assert all(t.get("state") == "QUEUED" for t in r3c.json().get("tasks", [])) + + r3d = c.get("/api/v2/tasks?states=NOPE", headers=headers) + assert r3d.status_code == 400 + r4 = c.post("/api/v2/tasks/tid1:cancel", headers=headers) assert r4.status_code == 200 assert r4.json()["state"] == "CANCELED" @@ -118,6 +134,14 @@ def test_task_submit_get_cancel_logs_queue(tmp_path: Path, monkeypatch): db.create_attempt(task_id="tid2", attempt_no=1, ray_submission_id="sid2") db.set_task_state(task_id="tid2", state="RUNNING", latest_attempt_no=1) + r6 = c.get("/api/v2/tasks?limit=1&offset=0&states=RUNNING", headers=headers) + assert r6.status_code == 200 + assert any(t.get("task_id") == "tid2" for t in r6.json().get("tasks", [])) + + r7 = c.get("/api/v2/tasks?limit=1&offset=1&states=RUNNING", headers=headers) + assert r7.status_code == 200 + assert "has_more" in r7.json() + r5 = c.get("/api/v2/tasks/tid2/logs?tail=1", headers=headers) assert r5.status_code == 200 assert r5.text.strip() == "c" @@ -163,3 +187,102 @@ def test_submit_rejects_invalid_jobspec(tmp_path: Path, monkeypatch): with TestClient(app) as c: r = c.post("/api/v2/tasks", headers={"authorization": "Bearer token1"}, data="workload: nope\n") assert r.status_code == 400 + + +def test_me_sftp_reset_password_disabled_returns_400(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1") + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + # seed user + token + from argus.service.config import V2Config + from argus.service.db import Db + + root = yaml.safe_load(cfg_path.read_text(encoding="utf-8")) + v2_cfg = V2Config.from_root_dict(root) + db = Db(v2_cfg.sqlite.db_path) + db.init() + db.create_user(user_id="u1", display_name=None) + token = db.issue_token(user_id="u1") + + with TestClient(app) as c: + r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"}) + assert r.status_code == 400 + + +def test_me_sftp_reset_password_enabled_returns_password(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg = yaml.safe_load(_write_config(tmp_path).read_text(encoding="utf-8")) + cfg["data"]["sftpgo"] = { + "enabled": True, + "host": "127.0.0.1", + "sftp_port": 2022, + "admin_api_base": "http://127.0.0.1:8081", + "admin_user": "admin", + "admin_password_env": "SFTPGO_ADMIN_PASSWORD", + } + cfg_path = tmp_path / "cfg_sftp.yaml" + cfg_path.write_text(yaml.safe_dump(cfg), encoding="utf-8") + + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "token1") + monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1") + + class _FakeSFTPGo: + def __init__(self, **kwargs): + self.reset = [] + self.enabled = [] + + def reset_password(self, username: str, new_password: str, home_dir: str): + assert username + assert new_password + assert home_dir + self.reset.append((username, home_dir)) + + def enable_user(self, username: str, home_dir: str): + self.enabled.append((username, home_dir)) + + fake_client = _FakeSFTPGo() + + class _FakeSFTPGoFactory: + def __call__(self, **kwargs): + return fake_client + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _FakeSFTPGoFactory()) + app = app_mod.create_app(str(cfg_path)) + + # seed user in DB + from argus.service.db import Db + from argus.service.config import V2Config + + v2_cfg = V2Config.from_root_dict(cfg) + db = Db(v2_cfg.sqlite.db_path) + db.init() + db.create_user(user_id="u1", display_name=None) + token = db.issue_token(user_id="u1") + + with TestClient(app) as c: + r = c.post("/api/v2/me/sftp:reset_password", headers={"authorization": f"Bearer {token}"}) + assert r.status_code == 200 + j = r.json() + assert j["user_id"] == "u1" + assert isinstance(j["password"], str) and len(j["password"]) >= 8 diff --git a/src/mvp/py/tests/test_janitor.py b/src/mvp/py/tests/test_janitor.py new file mode 100644 index 0000000..8528555 --- /dev/null +++ b/src/mvp/py/tests/test_janitor.py @@ -0,0 +1,225 @@ +from __future__ import annotations + +from datetime import datetime, timedelta, timezone +from pathlib import Path + +from argus.service.db import Db +from argus.service.janitor import JobsJanitor + + +def _iso_z(dt: datetime) -> str: + return dt.astimezone(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z") + + +def _mk_job_dir(user_root: Path, user_id: str, sid: str) -> Path: + p = user_root / user_id / "jobs" / sid + p.mkdir(parents=True, exist_ok=True) + (p / "marker.txt").write_text("x", encoding="utf-8") + return p + + +def _mk_trash_dir(user_root: Path, user_id: str, sid: str) -> Path: + p = user_root / user_id / "trash" / "jobs" / sid + p.mkdir(parents=True, exist_ok=True) + (p / "marker.txt").write_text("x", encoding="utf-8") + return p + + +def test_janitor_moves_jobs_to_trash_after_3_days(tmp_path: Path) -> None: + db_path = tmp_path / "mvp.sqlite3" + user_root = tmp_path / "users" + db = Db(str(db_path)) + db.init() + + task_id = "t1" + user_id = "alice" + sid = "sid-a01" + db.create_task_v25(task_id=task_id, user_id=user_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid) + + now = datetime(2025, 1, 10, tzinfo=timezone.utc) + ended = now - timedelta(days=4) + db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED") + + src = _mk_job_dir(user_root, user_id, sid) + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1) + jan.tick_once(now=now) + + assert not src.exists() + dst = user_root / user_id / "trash" / "jobs" / sid + assert dst.exists() + assert (dst / "marker.txt").read_text(encoding="utf-8") == "x" + + +def test_janitor_purges_from_trash_after_7_days(tmp_path: Path) -> None: + db_path = tmp_path / "mvp.sqlite3" + user_root = tmp_path / "users" + db = Db(str(db_path)) + db.init() + + task_id = "t2" + user_id = "alice" + sid = "sid-a01" + db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid) + + now = datetime(2025, 1, 10, tzinfo=timezone.utc) + ended = now - timedelta(days=8) + db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED") + + _mk_trash_dir(user_root, user_id, sid) + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1) + jan.tick_once(now=now) + + dst = user_root / user_id / "trash" / "jobs" / sid + assert not dst.exists() + + +def test_janitor_does_not_touch_recent_jobs(tmp_path: Path) -> None: + db_path = tmp_path / "mvp.sqlite3" + user_root = tmp_path / "users" + db = Db(str(db_path)) + db.init() + + task_id = "t3" + user_id = "alice" + sid = "sid-a01" + db.create_task_v25(task_id=task_id, user_id=user_id, workload="grpo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid) + + now = datetime(2025, 1, 10, tzinfo=timezone.utc) + ended = now - timedelta(days=1) + db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="FAILED") + + src = _mk_job_dir(user_root, user_id, sid) + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1) + jan.tick_once(now=now) + + assert src.exists() + assert not (user_root / user_id / "trash" / "jobs" / sid).exists() + + +def test_janitor_skips_tasks_without_user_id(tmp_path: Path) -> None: + db_path = tmp_path / "mvp.sqlite3" + user_root = tmp_path / "users" + db = Db(str(db_path)) + db.init() + + task_id = "legacy" + sid = "sid-legacy" + db.create_task(task_id=task_id, workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid) + + now = datetime(2025, 1, 10, tzinfo=timezone.utc) + ended = now - timedelta(days=10) + db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED") + + # Even if someone created a matching directory under user_root, janitor shouldn't touch it because user_id is NULL. + src = _mk_job_dir(user_root, "alice", sid) + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1) + jan.tick_once(now=now) + assert src.exists() + + +def test_janitor_validates_retention_days(tmp_path: Path) -> None: + db = Db(str(tmp_path / "mvp.sqlite3")) + db.init() + try: + JobsJanitor(db=db, user_root="/tmp", trash_after_days=-1, purge_after_days=7, interval_s=1) + raise AssertionError("expected ValueError") + except ValueError: + pass + try: + JobsJanitor(db=db, user_root="/tmp", trash_after_days=3, purge_after_days=1, interval_s=1) + raise AssertionError("expected ValueError") + except ValueError: + pass + + +def test_janitor_noop_when_disabled(tmp_path: Path) -> None: + db = Db(str(tmp_path / "mvp.sqlite3")) + db.init() + user_root = tmp_path / "users" + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=0, purge_after_days=0, interval_s=1) + jan.tick_once(now=datetime(2025, 1, 1, tzinfo=timezone.utc)) + + +def test_janitor_handles_invalid_end_time_and_missing_fields(tmp_path: Path) -> None: + db_path = tmp_path / "mvp.sqlite3" + user_root = tmp_path / "users" + db = Db(str(db_path)) + db.init() + + now = datetime(2025, 1, 10, tzinfo=timezone.utc) + cutoff_ended = now - timedelta(days=10) + + # Missing end_time (empty string) => should be skipped. + db.create_task_v25(task_id="t4", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id="t4", attempt_no=1, ray_submission_id="sid-empty") + db.update_attempt(task_id="t4", attempt_no=1, end_time="", ray_status="SUCCEEDED") + + # Invalid ISO but still lexicographically <= cutoff => should be skipped. + db.create_task_v25(task_id="t5", user_id="alice", workload="sft", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id="t5", attempt_no=1, ray_submission_id="sid-bad") + db.update_attempt(task_id="t5", attempt_no=1, end_time="2025-01-01T00:00:00ZZ", ray_status="FAILED") + + _mk_job_dir(user_root, "alice", "sid-empty") + _mk_job_dir(user_root, "alice", "sid-bad") + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1) + jan.tick_once(now=cutoff_ended + timedelta(days=10)) + + assert (user_root / "alice" / "jobs" / "sid-empty").exists() + assert (user_root / "alice" / "jobs" / "sid-bad").exists() + + +def test_janitor_purge_moves_from_jobs_then_deletes(tmp_path: Path, monkeypatch) -> None: + db_path = tmp_path / "mvp.sqlite3" + user_root = tmp_path / "users" + db = Db(str(db_path)) + db.init() + + task_id = "t6" + user_id = "alice" + sid = "sid-a01" + db.create_task_v25(task_id=task_id, user_id=user_id, workload="ppo", jobspec_yaml="{}", nnodes=1, n_gpus_per_node=1) + db.create_attempt(task_id=task_id, attempt_no=1, ray_submission_id=sid) + + now = datetime(2025, 1, 10, tzinfo=timezone.utc) + ended = now - timedelta(days=9) + db.update_attempt(task_id=task_id, attempt_no=1, end_time=_iso_z(ended), ray_status="SUCCEEDED") + + src = _mk_job_dir(user_root, user_id, sid) + jan = JobsJanitor(db=db, user_root=str(user_root), trash_after_days=3, purge_after_days=7, interval_s=1) + + jan.tick_once(now=now) + assert not src.exists() + assert not (user_root / user_id / "trash" / "jobs" / sid).exists() + + +def test_janitor_run_forever_requires_event_like(tmp_path: Path) -> None: + db = Db(str(tmp_path / "mvp.sqlite3")) + db.init() + jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1) + try: + jan.run_forever(object()) + raise AssertionError("expected ValueError") + except ValueError: + pass + + +def test_janitor_run_forever_survives_tick_exceptions(tmp_path: Path, monkeypatch) -> None: + db = Db(str(tmp_path / "mvp.sqlite3")) + db.init() + jan = JobsJanitor(db=db, user_root=str(tmp_path / "users"), trash_after_days=3, purge_after_days=7, interval_s=1) + + class Flag: + def __init__(self) -> None: + self.n = 0 + + def is_set(self) -> bool: + self.n += 1 + return self.n >= 2 + + monkeypatch.setattr(jan, "tick_once", lambda **_: (_ for _ in ()).throw(RuntimeError("boom"))) + monkeypatch.setattr("argus.service.janitor.time.sleep", lambda *_: None) + jan.run_forever(Flag()) diff --git a/src/mvp/py/tests/test_service_config.py b/src/mvp/py/tests/test_service_config.py index c6ceb0e..9fe7226 100644 --- a/src/mvp/py/tests/test_service_config.py +++ b/src/mvp/py/tests/test_service_config.py @@ -38,3 +38,18 @@ def test_v2_config_requires_mappings(): V2Config.from_root_dict({"service": ["nope"]}) with pytest.raises(ValueError, match="config\\.service\\.\\{api,auth,sqlite,scheduler\\} must be mappings"): V2Config.from_root_dict({"service": {"api": [1], "auth": {}, "sqlite": {}, "scheduler": {}}}) + + +def test_v2_config_requires_data_mappings(): + from argus.service.config import V2Config + + base = { + "ray": {"shared_root": "/private"}, + "service": {"api": {}, "auth": {}, "sqlite": {}, "scheduler": {}}, + } + + with pytest.raises(ValueError, match="config\\.data must be a mapping"): + V2Config.from_root_dict({**base, "data": ["nope"]}) + + with pytest.raises(ValueError, match="config\\.data\\.\\{sftpgo,retention\\} must be mappings"): + V2Config.from_root_dict({**base, "data": {"sftpgo": ["x"], "retention": {}}}) diff --git a/src/mvp/py/tests/test_sftpgo.py b/src/mvp/py/tests/test_sftpgo.py new file mode 100644 index 0000000..bc0440f --- /dev/null +++ b/src/mvp/py/tests/test_sftpgo.py @@ -0,0 +1,322 @@ +from __future__ import annotations + +from pathlib import Path + +import yaml +from fastapi.testclient import TestClient + + +def _write_config(tmp_path: Path, *, enabled: bool) -> Path: + cfg = { + "ray": { + "address": "http://127.0.0.1:8265", + "shared_root": "/private", + "entrypoint_resources": {"worker_node": 1}, + "runtime_env": {"env_vars": {}}, + }, + "data": { + "user_root": str(tmp_path / "users"), + "sftpgo": { + "enabled": bool(enabled), + "admin_api_base": "http://127.0.0.1:8081/api/v2", + "admin_user": "admin", + "admin_password_env": "SFTPGO_ADMIN_PASSWORD", + "host": "h1.internal", + "sftp_port": 2022, + }, + }, + "service": { + "api": {"host": "127.0.0.1", "port": 0}, + "auth": {"token_env": "MVP_INTERNAL_TOKEN"}, + "sqlite": {"db_path": str(tmp_path / "mvp.sqlite3")}, + "scheduler": {"tick_s": 1, "retry_interval_s": 1, "max_running_tasks": 1}, + }, + } + p = tmp_path / "cfg.yaml" + p.write_text(yaml.safe_dump(cfg), encoding="utf-8") + return p + + +def test_create_user_calls_sftpgo_when_enabled(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path, enabled=True) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1") + + calls = {"create": [], "disable": [], "reset": []} + + class _Client: + def __init__(self, **kwargs): + pass + + def create_user(self, *, username: str, password: str, home_dir: str) -> None: + calls["create"].append((username, password, home_dir)) + + def enable_user(self, *, username: str, home_dir: str) -> None: + # Not used in this test, but required by app for idempotent upsert. + return None + + def disable_user(self, *, username: str, home_dir: str) -> None: + calls["disable"].append(username) + + def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None: + calls["reset"].append((username, new_password)) + + monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client) + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}) + assert r.status_code == 200 + assert calls["create"] + username, password, home_dir = calls["create"][-1] + assert username == "alice" + assert password + assert home_dir.endswith("/users/alice") + + r2 = c.post("/api/v2/users/alice:disable", headers=admin_headers) + assert r2.status_code == 200 + assert calls["disable"] == ["alice"] + + r3 = c.post("/api/v2/users/alice/sftp:reset_password", headers=admin_headers) + assert r3.status_code == 200 + body = r3.json() + assert body["user_id"] == "alice" + assert body["password"] + assert calls["reset"] and calls["reset"][-1][0] == "alice" + + +def test_create_user_upserts_when_sftpgo_user_already_exists(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + from argus.service.sftpgo import SFTPGoError + + cfg_path = _write_config(tmp_path, enabled=True) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + monkeypatch.setenv("SFTPGO_ADMIN_PASSWORD", "pw1") + + calls = {"create": 0, "reset": [], "enable": []} + + class _Client: + def __init__(self, **kwargs): + pass + + def create_user(self, *, username: str, password: str, home_dir: str) -> None: + calls["create"] += 1 + raise SFTPGoError("sftpgo http error: 409 Conflict") + + def reset_password(self, *, username: str, new_password: str, home_dir: str) -> None: + calls["reset"].append((username, new_password)) + + def enable_user(self, *, username: str, home_dir: str) -> None: + calls["enable"].append(username) + + def disable_user(self, *, username: str, home_dir: str) -> None: + return None + + monkeypatch.setattr(app_mod, "SFTPGoAdminClient", _Client) + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"}) + assert r.status_code == 200 + body = r.json() + assert body["user_id"] == "bob" + assert body.get("sftp", {}).get("password") + assert calls["create"] == 1 + assert calls["reset"] and calls["reset"][-1][0] == "bob" + assert calls["enable"] == ["bob"] + + +def test_sftpgo_enabled_requires_admin_password_env(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path, enabled=True) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + monkeypatch.delenv("SFTPGO_ADMIN_PASSWORD", raising=False) + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + r = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}) + assert r.status_code == 500 + assert "SFTPGO_ADMIN_PASSWORD" in r.text + + +def test_sftpgo_admin_client_builds_requests(monkeypatch): + import json + from urllib.request import Request + + from argus.service.sftpgo import SFTPGoAdminClient + import argus.service.sftpgo as mod + + seen: list[Request] = [] + + class _Resp: + def __init__(self, body: bytes = b"ok"): + self._body = body + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc, tb): + return False + + def read(self): + return self._body + + def fake_urlopen(req, timeout=0): + seen.append(req) + if req.full_url.endswith("/token"): + return _Resp(body=b'{"access_token":"t1","expires_at":0}') + # allow folder creates to be idempotent in tests + if req.full_url.endswith("/folders") and req.get_method() == "POST": + return _Resp(body=b'{"message":"created"}') + if req.full_url.endswith("/users/alice") and req.get_method() == "GET": + return _Resp( + body=b'{"username":"alice","status":1,"home_dir":"/private/users/alice","uid":0,"gid":0,"permissions":{"/":["*"]},"virtual_folders":[]}' + ) + return _Resp() + + monkeypatch.setattr(mod, "urlopen", fake_urlopen) + + c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common") + c.create_user(username="alice", password="p1", home_dir="/private/users/alice") + c.disable_user(username="alice", home_dir="/private/users/alice") + c.enable_user(username="alice", home_dir="/private/users/alice") + c.reset_password(username="alice", new_password="p2", home_dir="/private/users/alice") + + # Each operation fetches a token, then issues a request. + # For create_user: token + ensure folders (2 POSTs) + create user + # For disable/enable/reset: token + ensure folders (2 POSTs) + GET user + PUT user + assert len(seen) == 1 + 2 + 1 + 3 * (1 + 2 + 1 + 1) + assert seen[0].full_url.endswith("/token") + assert seen[0].headers.get("Authorization", "").startswith("Basic ") + # create_user + assert seen[1].full_url.endswith("/folders") + assert seen[2].full_url.endswith("/folders") + assert seen[3].full_url.endswith("/users") + assert seen[3].headers.get("Authorization", "") == "Bearer t1" + created = json.loads(seen[3].data.decode("utf-8")) + assert created["username"] == "alice" + assert "/common" in created.get("permissions", {}) + assert "/common/datasets" in created.get("permissions", {}) + assert "/common/hf" in created.get("permissions", {}) + + # disable_user + assert seen[4].full_url.endswith("/token") + assert seen[5].full_url.endswith("/folders") + assert seen[6].full_url.endswith("/folders") + assert seen[7].full_url.endswith("/users/alice") + assert seen[7].get_method() == "GET" + assert seen[8].full_url.endswith("/users/alice") + assert seen[8].get_method() == "PUT" + + # enable_user + assert seen[9].full_url.endswith("/token") + assert seen[10].full_url.endswith("/folders") + assert seen[11].full_url.endswith("/folders") + assert seen[12].full_url.endswith("/users/alice") + assert seen[12].get_method() == "GET" + assert seen[13].full_url.endswith("/users/alice") + assert seen[13].get_method() == "PUT" + + # reset_password + assert seen[14].full_url.endswith("/token") + assert seen[15].full_url.endswith("/folders") + assert seen[16].full_url.endswith("/folders") + assert seen[17].full_url.endswith("/users/alice") + assert seen[17].get_method() == "GET" + assert seen[18].full_url.endswith("/users/alice") + assert seen[18].get_method() == "PUT" + + +def test_sftpgo_admin_client_http_error_raises(monkeypatch): + from urllib.error import HTTPError + + from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError + import argus.service.sftpgo as mod + + def fake_urlopen(req, timeout=0): + if req.full_url.endswith("/token"): + class _Resp: + def __enter__(self): + return self + + def __exit__(self, exc_type, exc, tb): + return False + + def read(self): + return b'{"access_token":"t1","expires_at":0}' + + return _Resp() + raise HTTPError(req.full_url, 500, "boom", hdrs=None, fp=None) + + monkeypatch.setattr(mod, "urlopen", fake_urlopen) + + c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw", common_root="/private/common") + try: + c.create_user(username="alice", password="p1", home_dir="/private/users/alice") + assert False, "expected SFTPGoError" + except SFTPGoError as e: + assert "http error" in str(e) + + +def test_sftpgo_admin_client_url_error_raises(monkeypatch): + from urllib.error import URLError + + from argus.service.sftpgo import SFTPGoAdminClient, SFTPGoError + import argus.service.sftpgo as mod + + def fake_urlopen(req, timeout=0): + if req.full_url.endswith("/token"): + class _Resp: + def __enter__(self): + return self + + def __exit__(self, exc_type, exc, tb): + return False + + def read(self): + return b'{"access_token":"t1","expires_at":0}' + + return _Resp() + raise URLError("no route") + + monkeypatch.setattr(mod, "urlopen", fake_urlopen) + + c = SFTPGoAdminClient(admin_api_base="http://sftpgo.local/api/v2", admin_user="admin", admin_password="pw") + try: + c.disable_user(username="alice", home_dir="/private/users/alice") + assert False, "expected SFTPGoError" + except SFTPGoError as e: + assert "connection error" in str(e) diff --git a/src/mvp/py/tests/test_ui.py b/src/mvp/py/tests/test_ui.py new file mode 100644 index 0000000..ab4ebd5 --- /dev/null +++ b/src/mvp/py/tests/test_ui.py @@ -0,0 +1,78 @@ +from __future__ import annotations + +from pathlib import Path + +from fastapi.testclient import TestClient + +from argus.service.app import create_app + + +def _write_config(tmp_path: Path) -> Path: + p = tmp_path / "cfg.yaml" + p.write_text( + """ +ray: + address: "http://127.0.0.1:8265" + shared_root: "/private" + entrypoint_num_cpus: 1 + entrypoint_resources: { worker_node: 1 } + runtime_env: { env_vars: { PYTHONUNBUFFERED: "1" } } +service: + api: { host: "127.0.0.1", port: 8080 } + auth: { token_env: "MVP_INTERNAL_TOKEN" } + sqlite: { db_path: "%(db)s" } +data: + user_root: "%(users)s" + sftpgo: { enabled: false } + retention: { jobs_trash_after_days: 3, jobs_purge_after_days: 7, janitor_interval_s: 3600 } +""" + % {"db": str(tmp_path / "mvp.sqlite3"), "users": str(tmp_path / "users")} + ) + return p + + +def test_ui_routes_render_200(tmp_path, monkeypatch): + cfg = _write_config(tmp_path) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "admin-token") + app = create_app(str(cfg)) + c = TestClient(app) + + for path in ( + "/ui", + "/ui/login", + "/ui/tasks", + "/ui/tasks/new", + "/ui/data", + "/ui/admin", + "/ui/tasks/any-task-id", + "/ui/tasks/any-task-id/logs", + ): + r = c.get(path, allow_redirects=True) + assert r.status_code == 200 + assert " Path: "entrypoint_resources": {"worker_node": 1}, "runtime_env": {"env_vars": {}}, }, + "data": { + # Avoid touching real /private in tests. Keep ray.shared_root as /private + # so existing path validation tests remain unchanged. + "user_root": str(tmp_path / "users"), + }, "service": { "api": {"host": "127.0.0.1", "port": 0}, "auth": {"token_env": "MVP_INTERNAL_TOKEN"}, @@ -59,6 +64,9 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon admin_headers = {"authorization": "Bearer adm1"} with TestClient(app) as c: + # list users requires admin + assert c.get("/api/v2/users", headers={"authorization": "Bearer nope"}).status_code in (401, 403) + r1 = c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice", "display_name": "Alice"}) assert r1.status_code == 200 assert r1.json()["user_id"] == "alice" @@ -68,6 +76,11 @@ def test_admin_create_user_issue_token_and_disabled_rejected(tmp_path: Path, mon token = r2.json()["token"] assert token + r2b = c.get("/api/v2/users", headers=admin_headers) + assert r2b.status_code == 200 + users = r2b.json()["users"] + assert any(u.get("user_id") == "alice" for u in users) + # non-admin token can access regular endpoints r3 = c.get("/api/v2/queue", headers={"authorization": f"Bearer {token}"}) assert r3.status_code == 200 @@ -177,3 +190,165 @@ def test_submit_rejects_non_common_inputs(tmp_path: Path, monkeypatch): ) assert r.status_code == 400 assert "code_path must start with /private/common/" in r.text + + +def test_submit_accepts_user_dataset_paths_and_local_model_paths(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200 + alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"] + alice_headers = {"authorization": f"Bearer {alice_tok}"} + + # User dataset paths are allowed. + r1 = c.post( + "/api/v2/tasks", + headers=alice_headers, + data=( + "workload: ppo\n" + "code_path: /private/common/code/verl\n" + "model_id: Qwen/Qwen2.5-0.5B-Instruct\n" + "train_file: /private/users/alice/datasets/t\n" + ), + ) + assert r1.status_code == 200 + + # Local model paths under user models/ are allowed (no TaskSpec schema change). + r2 = c.post( + "/api/v2/tasks", + headers=alice_headers, + data=( + "workload: ppo\n" + "code_path: /private/common/code/verl\n" + "model_id: /private/users/alice/models/m1\n" + "train_file: /private/common/datasets/t\n" + ), + ) + assert r2.status_code == 200 + + +def test_submit_rejects_cross_user_paths_and_bad_local_model_dirs(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200 + assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "bob"}).status_code == 200 + alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"] + bob_tok = c.post("/api/v2/users/bob/tokens", headers=admin_headers).json()["token"] + bob_headers = {"authorization": f"Bearer {bob_tok}"} + + # Cross-user dataset path should be rejected. + r1 = c.post( + "/api/v2/tasks", + headers=bob_headers, + data=( + "workload: ppo\n" + "code_path: /private/common/code/verl\n" + "model_id: Qwen/Qwen2.5-0.5B-Instruct\n" + "train_file: /private/users/alice/datasets/t\n" + ), + ) + assert r1.status_code == 400 + assert "/private/users/bob/datasets/" in r1.text + + # Local model path must be under models/. + r2 = c.post( + "/api/v2/tasks", + headers=bob_headers, + data=( + "workload: ppo\n" + "code_path: /private/common/code/verl\n" + "model_id: /private/users/bob/jobs/j1/checkpoints\n" + "train_file: /private/common/datasets/t\n" + ), + ) + assert r2.status_code == 400 + assert "model_id local path must start with" in r2.text + + +def test_me_returns_paths_and_retention(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200 + alice_tok = c.post("/api/v2/users/alice/tokens", headers=admin_headers).json()["token"] + + r = c.get("/api/v2/me", headers={"authorization": f"Bearer {alice_tok}"}) + assert r.status_code == 200 + obj = r.json() + assert obj["user_id"] == "alice" + assert obj["paths"]["home"].endswith("/users/alice") + assert obj["paths"]["jobs"].endswith("/users/alice/jobs") + assert obj["paths"]["trash_jobs"].endswith("/users/alice/trash/jobs") + assert obj["retention"]["jobs_trash_after_days"] == 3 + assert obj["retention"]["jobs_purge_after_days"] == 7 + + +def test_create_user_creates_user_dirs(tmp_path: Path, monkeypatch): + from argus.service import app as app_mod + + cfg_path = _write_config(tmp_path) + monkeypatch.setenv("MVP_INTERNAL_TOKEN", "adm1") + + class _Scheduler: + def __init__(self, **kwargs): + self.tool = object() + + def run_forever(self, stop_flag): + return None + + monkeypatch.setattr(app_mod, "Scheduler", _Scheduler) + app = app_mod.create_app(str(cfg_path)) + + admin_headers = {"authorization": "Bearer adm1"} + with TestClient(app) as c: + assert c.post("/api/v2/users", headers=admin_headers, json={"user_id": "alice"}).status_code == 200 + + base = tmp_path / "users" / "alice" + assert (base / "datasets").is_dir() + assert (base / "models").is_dir() + assert (base / "code").is_dir() + assert (base / "jobs").is_dir() + assert (base / "trash" / "jobs").is_dir() diff --git a/src/mvp/scripts/12_install_api_deps.sh b/src/mvp/scripts/12_install_api_deps.sh index 565db17..90a2074 100755 --- a/src/mvp/scripts/12_install_api_deps.sh +++ b/src/mvp/scripts/12_install_api_deps.sh @@ -9,4 +9,4 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" source "${SCRIPT_DIR}/lib.sh" dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -U pip >/dev/null 2>&1 || true" -dexec "${HEAD_CONTAINER}" bash -lc "python3 -m pip install -r /workspace/mvp/py/requirements.txt" +dexec "${HEAD_CONTAINER}" bash -lc "python3 -c 'import fastapi,uvicorn,yaml' >/dev/null 2>&1 && echo 'api_deps_ok: skip' || (python3 -m pip install -r /workspace/mvp/py/requirements.txt || echo 'WARN: api deps install failed; continuing with preinstalled deps')" diff --git a/src/mvp/scripts/60_start_api.sh b/src/mvp/scripts/60_start_api.sh index 29cef85..ed2cbd1 100755 --- a/src/mvp/scripts/60_start_api.sh +++ b/src/mvp/scripts/60_start_api.sh @@ -22,7 +22,12 @@ if [[ -z "${MVP_INTERNAL_TOKEN:-}" ]]; then exit 1 fi -docker exec -d -e MVP_INTERNAL_TOKEN="${MVP_INTERNAL_TOKEN}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'" +env_args=(-e "MVP_INTERNAL_TOKEN=${MVP_INTERNAL_TOKEN}") +if [[ -n "${SFTPGO_ADMIN_PASSWORD:-}" ]]; then + env_args+=(-e "SFTPGO_ADMIN_PASSWORD=${SFTPGO_ADMIN_PASSWORD}") +fi + +docker exec -d "${env_args[@]}" "${HEAD_CONTAINER}" bash -lc "nohup python3 /workspace/mvp/py/server.py --config '${CONFIG_IN_CONTAINER}' >>'${LOG_PATH}' 2>&1 & echo \$! >'${PID_PATH}'" echo "[host] started; pid stored in ${PID_PATH} (container path)" echo "[host] logs: ${LOG_PATH} (container path)" diff --git a/src/mvp/scripts/run_all_v30_api.sh b/src/mvp/scripts/run_all_v30_api.sh new file mode 100755 index 0000000..78b502b --- /dev/null +++ b/src/mvp/scripts/run_all_v30_api.sh @@ -0,0 +1,242 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=lib.sh +source "${SCRIPT_DIR}/lib.sh" + +# E2E v3.0: +# - Start Ray (head + stateless workers) + SFTPGo (compose) +# - Start API server with v3.0 config +# - Create user (API triggers SFTPGo user create) and return one-time SFTP password +# - (Optional) verify SFTP login +# - Submit PPO/GRPO/SFT referencing user dataset paths and wait + +API_ADDR="${API_ADDR:-http://127.0.0.1:8080}" +ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-}" +USER_ID="${USER_ID:-alice}" +RESET_DB="${RESET_DB:-1}" +RESET_SFTPGO="${RESET_SFTPGO:-0}" +EXPECTED_RAY_NODES="${EXPECTED_RAY_NODES:-3}" # head + 2 workers +CLUSTER_NAME="${CLUSTER_NAME:-argus-ray}" + +CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/configs/dev_v30.yaml}" +SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}" +export SFTPGO_ADMIN_PASSWORD + +if [[ -z "${ADMIN_TOKEN}" ]]; then + echo "ERROR: MVP_INTERNAL_TOKEN must be set in host env (admin token)" >&2 + exit 1 +fi + +api_curl_admin() { + curl -sS -H "Authorization: Bearer ${ADMIN_TOKEN}" "$@" +} + +api_wait_ready() { + local tries="${1:-60}" + for i in $(seq 1 "${tries}"); do + if curl -sS -m 2 "${API_ADDR}/docs" >/dev/null 2>&1; then + echo "[host] api_ready: ${API_ADDR}" + return 0 + fi + echo "[host] waiting api... (${i}/${tries})" + sleep 2 + done + echo "ERROR: api not ready: ${API_ADDR}" >&2 + return 1 +} + +sftpgo_wait_ready() { + local tries="${1:-60}" + local url="${2:-http://127.0.0.1:8081/api/v2/token}" + for i in $(seq 1 "${tries}"); do + # SFTPGo admin endpoints require auth; readiness means "HTTP reachable and can issue token". + if curl -sS -m 2 -u "admin:${SFTPGO_ADMIN_PASSWORD}" "${url}" >/dev/null 2>&1; then + echo "[host] sftpgo_ready: ${url} (token ok)" + return 0 + fi + echo "[host] waiting sftpgo... (${i}/${tries})" + sleep 2 + done + echo "ERROR: sftpgo not ready: ${url}" >&2 + return 1 +} + +ray_wait_ready() { + local tries="${1:-60}" + for i in $(seq 1 "${tries}"); do + if curl -sS -m 2 "${RAY_DASHBOARD_ADDR}/api/version" >/dev/null 2>&1; then + echo "[host] ray_dashboard_ready: ${RAY_DASHBOARD_ADDR}" + return 0 + fi + echo "[host] waiting ray dashboard... (${i}/${tries})" + sleep 2 + done + echo "ERROR: ray dashboard not ready: ${RAY_DASHBOARD_ADDR}" >&2 + return 1 +} + +ray_wait_nodes() { + local want="${1:-3}" + local tries="${2:-60}" + for i in $(seq 1 "${tries}"); do + local out n + out="$(docker exec -i "${HEAD_CONTAINER}" python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False, logging_level='ERROR'); print(sum(1 for n in ray.nodes() if n.get('Alive')))" 2>/dev/null || true)" + n="$(printf '%s\n' "${out}" | tail -n 1 | tr -cd '0-9' || true)" + if [[ "${n}" =~ ^[0-9]+$ ]]; then + echo "[host] ray_nodes_alive=${n} (want>=${want})" + if [[ "${n}" -ge "${want}" ]]; then + return 0 + fi + else + echo "[host] waiting ray nodes... (${i}/${tries})" + fi + sleep 2 + done + echo "ERROR: ray nodes not ready (want>=${want})" >&2 + docker exec -i "${HEAD_CONTAINER}" bash -lc "ray status || true" >&2 || true + return 1 +} + +submit_taskspec_inline() { + local token="$1" + local yaml_body="$2" + local resp + resp="$(curl -sS -H "Authorization: Bearer ${token}" -H "Content-Type: application/yaml" --data-binary "${yaml_body}" "${API_ADDR}/api/v2/tasks")" + echo "[host] submit_resp: ${resp}" >&2 + printf '%s' "${resp}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["task_id"])' +} + +wait_task() { + local token="$1" + local task_id="$2" + while true; do + local body state + body="$(curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}")" + state="$(printf '%s' "${body}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["state"])')" + echo "[host] task ${task_id}: ${state}" + + if [[ "${state}" == "SUCCEEDED" ]]; then + return 0 + fi + if [[ "${state}" == "FAILED" || "${state}" == "CANCELED" ]]; then + echo "[host] terminal=${state}; tail logs (best-effort):" >&2 + curl -sS -H "Authorization: Bearer ${token}" "${API_ADDR}/api/v2/tasks/${task_id}/logs?tail=200" >&2 || true + return 1 + fi + sleep 10 + done +} + +echo "[host] ===== run_all_v30_api.sh begin =====" + +"${SCRIPT_DIR}/00_prereq_check.sh" +"${SCRIPT_DIR}/03_cleanup_v1_legacy.sh" +"${SCRIPT_DIR}/04_cleanup_v2_legacy.sh" + +echo "[host] bring down existing containers (best-effort)" +"${SCRIPT_DIR}/02_down.sh" || true + +if [[ "${RESET_SFTPGO}" == "1" ]]; then + echo "[host] reset sftpgo metadata dir (best-effort, via helper container)" + SFTPGO_META_DIR="${ROOT_DIR}/../../shared/common/sftpgo" + mkdir -p "${SFTPGO_META_DIR}" + docker run --rm --entrypoint sh -u 0:0 -v "${SFTPGO_META_DIR}:/mnt" drakkan/sftpgo:latest -lc "rm -rf /mnt/* || true" +fi + +echo "[host] (re)create containers (Ray + SFTPGo)" +"${SCRIPT_DIR}/01_up.sh" + +echo "[host] wait ray head ready" +ray_wait_ready 60 + +echo "[host] wait sftpgo ready" +sftpgo_wait_ready 60 "http://127.0.0.1:8081/api/v2/token" + +echo "[host] render v3.0 config with SFTPGo container IP (work around docker DNS issues)" +SFTPGO_IP="$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' argus-sftpgo)" +RENDERED_CFG_HOST_PATH="/tmp/dev_v30_rendered.yaml" +sed -E "s#^(\\s*admin_api_base:) .*#\\1 \"http://${SFTPGO_IP}:8080/api/v2\"#g" "${ROOT_DIR}/configs/dev_v30.yaml" >"${RENDERED_CFG_HOST_PATH}" +docker cp "${RENDERED_CFG_HOST_PATH}" "${HEAD_CONTAINER}:/tmp/dev_v30_rendered.yaml" +CONFIG_IN_CONTAINER="/tmp/dev_v30_rendered.yaml" + +echo "[host] verify head discovery record (supervised in-container)" +HEAD_IP_FILE="${SHARED_ROOT}/ray/discovery/${CLUSTER_NAME}/head.json" +dexec "${HEAD_CONTAINER}" bash -lc "test -f '${HEAD_IP_FILE}' && python3 -c 'import json,sys; print(json.load(open(sys.argv[1]))[\"job_server_url\"])' '${HEAD_IP_FILE}' || true" + +echo "[host] wait workers join" +ray_wait_nodes "${EXPECTED_RAY_NODES}" 120 + +echo "[host] prepare data/model (idempotent; reuse cache)" +"${SCRIPT_DIR}/30_prepare_data_and_model.sh" + +echo "[host] install api deps in head container" +"${SCRIPT_DIR}/12_install_api_deps.sh" + +echo "[host] stop api (best-effort)" +"${SCRIPT_DIR}/61_stop_api.sh" || true + +if [[ "${RESET_DB}" == "1" ]]; then + echo "[host] reset api sqlite db in container (best-effort)" + docker exec -i "${HEAD_CONTAINER}" bash -lc "rm -f /private/common/db/mvp.sqlite3 /private/common/db/mvp.sqlite3-wal /private/common/db/mvp.sqlite3-shm || true" +else + echo "[host] keep existing api sqlite db (RESET_DB=${RESET_DB})" +fi + +echo "[host] start api (admin token + sftpgo admin password via env)" +MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER}" SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD}" "${SCRIPT_DIR}/60_start_api.sh" +api_wait_ready 60 + +echo "[host] create user (expect SFTP one-time password in response)" +create_resp="$(api_curl_admin -H "Content-Type: application/json" -d "{\"user_id\":\"${USER_ID}\"}" "${API_ADDR}/api/v2/users")" +echo "[host] create_user_resp: ${create_resp}" +USER_SFTP_PASSWORD="$(printf '%s' "${create_resp}" | python3 -c 'import sys,json; o=json.load(sys.stdin); print((o.get("sftp") or {}).get("password") or "")')" +if [[ -z "${USER_SFTP_PASSWORD}" ]]; then + echo "ERROR: expected sftp.password in create user response (is data.sftpgo.enabled=true?)" >&2 + exit 1 +fi + +echo "[host] issue user token" +USER_TOKEN="$(api_curl_admin -X POST "${API_ADDR}/api/v2/users/${USER_ID}/tokens" | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')" +echo "[host] user_token_issued: user=${USER_ID}" + +echo "[host] (optional) verify SFTP login (best-effort)" +if command -v sshpass >/dev/null 2>&1 && command -v sftp >/dev/null 2>&1; then + tmp_batch="$(mktemp)" + cat >"${tmp_batch}" </dev/null 2>&1 || true + rm -f "${tmp_batch}" || true +else + echo "[host] skip: sshpass/sftp not found; you can test manually with: sftp -P 2022 ${USER_ID}@" +fi + +echo "[host] ensure user dataset paths exist (copy from common if needed; best-effort)" +echo "[host] copy dataset into user datasets path inside head container (avoid host permission issues)" +dexec "${HEAD_CONTAINER}" bash -lc "set -euo pipefail; \ + mkdir -p '/private/users/${USER_ID}/datasets/gsm8k' '/private/users/${USER_ID}/datasets/gsm8k_sft'; \ + (cp -f /private/common/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/train.parquet '/private/users/${USER_ID}/datasets/gsm8k/train.parquet' 2>/dev/null || true); \ + (cp -f /private/common/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k/test.parquet '/private/users/${USER_ID}/datasets/gsm8k/test.parquet' 2>/dev/null || true); \ + (cp -f /private/common/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || true); \ + (cp -f /private/common/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || true)" + +echo "[host] submit PPO/GRPO/SFT via API using user dataset paths" +PPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: ppo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')" +GRPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: grpo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')" +SFT_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: sft\nnnodes: 1\nn_gpus_per_node: 1\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k_sft/test.parquet\nmodel_id: Qwen/Qwen2.5-0.5B-Instruct\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')" + +echo "[host] submitted task ids:" +echo " ppo=${PPO_TASK_ID}" +echo " grpo=${GRPO_TASK_ID}" +echo " sft=${SFT_TASK_ID}" + +echo "[host] wait for tasks (in submission order)" +wait_task "${USER_TOKEN}" "${PPO_TASK_ID}" +wait_task "${USER_TOKEN}" "${GRPO_TASK_ID}" +wait_task "${USER_TOKEN}" "${SFT_TASK_ID}" + +echo "[host] ===== run_all_v30_api.sh done =====" diff --git a/src/mvp/scripts/run_e2e_v30_cases.sh b/src/mvp/scripts/run_e2e_v30_cases.sh new file mode 100755 index 0000000..dc56b19 --- /dev/null +++ b/src/mvp/scripts/run_e2e_v30_cases.sh @@ -0,0 +1,19 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Minimal E2E runner for v3.0: +# - Happy path: Ray + SFTPGo + API + 3 workloads +# - Leaves retention validation as a manual follow-up (adjust thresholds). + +ADMIN_TOKEN="${MVP_INTERNAL_TOKEN:-my-dev-token}" +USER_ID="${USER_ID:-alice}" + +echo "[case] HP-1: run_all_v30_api.sh (Ray + SFTPGo + API + 3 workloads)" +MVP_INTERNAL_TOKEN="${ADMIN_TOKEN}" USER_ID="${USER_ID}" RESET_DB=1 RESET_SFTPGO=0 "${SCRIPT_DIR}/run_all_v30_api.sh" + +echo "[case] NOTE: retention validation (C2) is manual:" +echo " - set data.retention.jobs_trash_after_days / jobs_purge_after_days to small values in configs/dev_v30.yaml" +echo " - restart API server and wait for janitor" +