V3.9 重构ui.py,每个网页拆分独立文件

This commit is contained in:
yuyr 2026-01-12 17:24:34 +08:00
parent 938f84f1ba
commit ac9c80ed8c
27 changed files with 1896 additions and 1456 deletions

1
.gitignore vendored
View File

@ -6,3 +6,4 @@ __pycache__/
.pytest_cache/
.coverage
htmlcov/
AGENTS.md

165
README.md Normal file
View File

@ -0,0 +1,165 @@
# Argus Cluster
本仓库主要代码位于 `src/mvp/py/`,单元测试位于 `src/mvp/py/tests/``pytest.ini` 已配置 `testpaths`)。
## 使用 uv 创建/激活虚拟环境并安装依赖
前置条件:已安装 `uv`Astral 的 Python 包/虚拟环境工具)。
1) 在仓库根目录创建虚拟环境:
```bash
uv venv .venv
```
2) 激活虚拟环境:
```bash
source .venv/bin/activate
```
3) 在虚拟环境中安装依赖(运行与测试依赖):
```bash
uv pip install -r src/mvp/py/requirements.txt -r src/mvp/py/requirements-dev.txt
```
## 运行单元测试
在仓库根目录执行:
```bash
pytest
```
如需显式使用 Python 模块方式:
```bash
python -m pytest
```
## 远端开发h1同步代码、构建镜像、初始化共享目录、拉起 Ray 集群
`src/mvp/docker-compose.yaml``src/mvp/` 为工作目录,并挂载 `../../shared``../../verl`,因此推荐远端目录结构如下:
- `/home2/argus/infra/mvp/src/mvp/`:本仓库的 `src/mvp/` 内容
- `/home2/argus/infra/mvp/shared/`:共享目录(模拟/对齐 NFS
- `/home2/argus/infra/mvp/verl/`:训练代码仓库(脚本会检查该目录存在)
1) 同步代码到远端(只同步 `src/mvp/`,确保相对路径挂载正确):
```bash
ssh argus@h1 "mkdir -p /home2/argus/infra/mvp/src/mvp"
rsync -av --delete src/mvp/ argus@h1:/home2/argus/infra/mvp/src/mvp/
```
2) 在远端准备 `verl` 仓库(若已存在可跳过):
```bash
ssh argus@h1 "mkdir -p /home2/argus/infra/mvp && test -d /home2/argus/infra/mvp/verl || echo '缺少 /home2/argus/infra/mvp/verl请先 git clone'"
# 下载verl
ssh argus@h1 "cd /home2/argus/infra/mvp && git clone https://github.com/volcengine/verl.git"
```
3) 登录远端并初始化共享目录环境(会创建 `../../shared/...` 等目录):
```bash
ssh argus@h1
cd /home2/argus/infra/mvp/src/mvp
./scripts/00_prereq_check.sh
```
4) 构建 Ray 节点镜像并拉起集群(首次或镜像不存在时建议强制构建):
```bash
cd /home2/argus/infra/mvp/src/mvp
BUILD=1 ./scripts/01_up.sh
```
5) 验证集群状态:
```bash
./scripts/50_status.sh
curl -sS http://127.0.0.1:8265/api/version | head
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' | grep -E 'argus-ray-|sftpgo|wandb' || true
```
6) 关闭集群:
```bash
./scripts/02_down.sh
```
## 远端h1端口映射、创建 W&B user用户
使用以下命令将h1的端口映射到本机
```
ssh -p 12022 -L 8265:127.0.0.1:8265 -L 8080:127.0.0.1:8080 -L 8081:127.0.0.1:8081 -L 2022:127.0.0.1:2022 -L 8090:127.0.0.1:8090 -L 8000:127.0.0.1:8000 -o ProxyJump=ssh@jump.nasp.fit:36022 nasp@192.168.20.121
```
其中:
- 8265: ray dashboard
- 8080: webui & API 服务端口
- 8081: sftpgo web client
- 2022: sftpgo sftp协议端口
- 8090: weight & bias 网站端口
- 8000: model serving openai服务端口
## 远端h1预先准备数据/模型,启动与关闭 API Server
下面命令均在远端执行:
```bash
ssh argus@h1
cd /home2/argus/infra/mvp/src/mvp
```
1) 预先下载数据集与模型(写入共享目录,幂等可重复执行):
```bash
# 可选:指定要缓存的模型(默认 Qwen/Qwen2.5-0.5B-Instruct
MODEL_ID="Qwen/Qwen2.5-0.5B-Instruct" ./scripts/30_prepare_data_and_model.sh
```
2) 安装 API 依赖(在 head 容器内 best-effort 安装 `fastapi/uvicorn/yaml` 等):
```bash
./scripts/12_install_api_deps.sh
```
3) 启动 API Server监听 `:8080`,需要设置鉴权 token
```bash
export MVP_INTERNAL_TOKEN="your-dev-token"
./scripts/60_start_api.sh
curl -sS http://127.0.0.1:8080/docs | head
```
查看 API 进程状态(基于 pid 文件):
```bash
./scripts/62_status_api.sh
```
4) 关闭 API Server
```bash
./scripts/61_stop_api.sh
```
## 远端h1W&B LocalwandbLicense 激活常见问题(端口转发)
如果你通过 SSH 端口转发访问 W&B UI但在 “Add license” 页面点击 “Update” 时提示:
`Unable to reach the backend api`,通常是因为 W&B 前端使用了“绝对 URL”去请求后端 API
而该 URL 的 host/port 与你当前端口转发的地址不一致(例如 UI 打开在 `http://localhost:8090`
但前端去请求 `http://localhost:8080`)。
建议二选一:
- 方式 A推荐最简单把本地 `8080` 转发到远端 `8090`,然后用 `http://localhost:8080` 打开 UI
- `ssh -L 8080:127.0.0.1:8090 argus@h1`
- 方式 B在远端启动容器前设置 `WANDB_HOST`(与浏览器地址一致),并重建 `wandb` 容器:
- `export WANDB_HOST=http://localhost:8090`
- `docker compose -f /home2/argus/infra/mvp/src/mvp/docker-compose.yaml up -d --force-recreate wandb`
也可以通过设置 `WANDB_LICENSE=...` 在容器启动时注入 license绕过 UI 更新步骤。

View File

@ -0,0 +1,108 @@
# v3.9 UI 重构方案(保持功能不变)
## 背景与问题
当前 `src/mvp/py/argus/service/ui.py` 单文件约 1400+ 行,包含:
- 全局 CSS/JS长字符串
- 布局渲染nav/page 拼接)
- 11 个页面的 HTML + 大段内嵌 JS包含 TaskSpec 模板与表单逻辑)
导致:变更难定位、合并冲突多、缺少模块边界、复用困难、测试覆盖薄弱。
## 目标(功能不变)
- **路由与页面行为完全不变**URL、返回内容、按钮/表单行为、localStorage key`mvp_token`/`mvp_sftp_password`、API 调用路径保持不变。
- **不引入前端构建链/新依赖**(仍然用纯字符串/轻量模板函数)。
- 将 UI 拆分为可维护的多个文件(放到 `src/mvp/py/argus/ui/`)。
- 增加最小的单测(确保路由可访问、关键 DOM 标识存在)。
## 非目标
- 不重做 UI 样式/交互;不引入 React/Vue不改后端 API。
- 不新增鉴权逻辑(仍然是浏览器 localStorage + Bearer token
## 拆分后的目录结构(建议)
新增包:`src/mvp/py/argus/ui/`
```
argus/ui/
__init__.py # register_ui_routes(app) 统一入口
assets/
base_css.py # BASE_CSS 常量
base_js.py # BASE_JS 常量apiFetch/apiJson 等通用函数)
layout/
nav.py # nav(active) + 链接配置
page.py # page(title, active, body, script, extra_head=...)
pages/
login.py # /ui/login
tasks.py # /ui/tasks
task_new.py # /ui/tasks/new模板常量 + 表单 JS
task_detail.py # /ui/tasks/{task_id}
task_logs.py # /ui/tasks/{task_id}/logs
serving.py # /ui/serving, /ui/serving/new, /ui/serving/{model_key}
data.py # /ui/data
admin.py # /ui/admin
routes.py # 将各 pages.register(app) 聚合注册
```
兼容层(可选但推荐):保留 `src/mvp/py/argus/service/ui.py` 仅做转发:
```py
from argus.ui import register_ui_routes
```
这样可以避免一次性改动 `service/app.py` 的 import 路径,减少风险。
## 页面拆分原则
每个 page 模块提供两个函数:
- `render(...) -> HTMLResponse`:只负责拼接 body/script不直接碰 FastAPI app
- `register(app: FastAPI) -> None`:只负责挂载路由(`@app.get(...)`)。
通用能力下沉:
- `_BASE_CSS`/`_BASE_JS` 移到 `assets/`
- `_nav()``_page()` 移到 `layout/`
- 大块常量TaskSpec 模板、UI 文案)放在页面模块同文件顶部,避免散落在函数内部。
## 资源交付方式(两种可选)
### 方案 A最稳继续内联 CSS/JS但拆到不同 Python 文件
- `page()` 内继续 `<style>{BASE_CSS}</style>``<script>{BASE_JS}</script>`
- 只改变代码组织,不改变浏览器加载方式,风险最低。
### 方案 B推荐中期新增静态端点分发资源
新增:
- `GET /ui/assets/base.css`
- `GET /ui/assets/base.js`
页面改为 `<link rel="stylesheet" href="/ui/assets/base.css">` + `<script src="/ui/assets/base.js"></script>`
优点:减少 HTML 体积、浏览器缓存更好;缺点:需要确认反向代理/中间件不拦截这些路由。
建议 v3.9 先落地方案 A稳定后再做方案 B。
## 迁移步骤(建议分 3 次 PR
1) **抽公共层**:引入 `argus/ui/assets/*``argus/ui/layout/*`,保持 UI 输出完全一致;`service/ui.py` 仍在但内部改为调用新 layout或先不动
2) **按页面迁移**:逐个把 routes 迁移到 `argus/ui/pages/*`每迁一个页面就加一个最小测试用例200 + 关键文本存在)。
3) **清理与稳定**`service/ui.py` 变为兼容转发;可选引入 `/ui/assets/*` 静态端点(方案 B
## 测试策略(最小但有效)
新增 `src/mvp/py/tests/test_ui_pages.py`
- 创建 FastAPI app复用现有测试的 app 初始化方式)
- 请求下列页面,断言 `status_code == 200`
- `/ui/login`, `/ui/tasks`, `/ui/tasks/new`, `/ui/serving`, `/ui/data`, `/ui/admin`
- 断言响应包含稳定锚点文本(例如 `Argus MVP`, `New Task`, `Tasks`),避免脆弱的全量快照。
## 验收标准Definition of Done
- 11 个 `/ui/*` 路由行为与输出不变(人工 smoke + 自动化最小测试)。
- `src/mvp/py/argus/service/ui.py` 不再包含大段 HTML/JS仅兼容转发或极薄封装
- 新增/修改 UI 页面不需要触碰 1000+ 行单文件;每页的改动范围限定在对应模块。

View File

@ -17,6 +17,10 @@ ray:
PYTHONUNBUFFERED: "1"
# v3.7: forbid HuggingFace Hub network access from Ray jobs (use cached snapshots).
HF_HUB_OFFLINE: "1"
# Unify cache dirs so `from_pretrained("org/name")` resolves from the same on-disk cache in offline mode.
HF_HOME: "/private/hf"
HUGGINGFACE_HUB_CACHE: "/private/hf/hub"
TRANSFORMERS_CACHE: "/private/hf/hub"
# 用户自定义代码目录(可被 PYTHONPATH 注入)
user_code_path: "/private/user/code"

View File

@ -17,6 +17,10 @@ ray:
PYTHONUNBUFFERED: "1"
# v3.7: forbid HuggingFace Hub network access from Ray jobs (use cached snapshots).
HF_HUB_OFFLINE: "1"
# Unify cache dirs so `from_pretrained("org/name")` resolves from the same on-disk cache in offline mode.
HF_HOME: "/private/hf"
HUGGINGFACE_HUB_CACHE: "/private/hf/hub"
TRANSFORMERS_CACHE: "/private/hf/hub"
# v3.0 先不支持 user code 执行
user_code_path: "/private/user/code"

View File

@ -95,6 +95,13 @@ services:
aliases:
- wandb
- argus-wandb
environment:
# W&B Local uses this as its externally-reachable base URL for the web app/backend.
# If you access via SSH port-forwarding, set WANDB_HOST to the exact URL you open in the browser
# (e.g. http://localhost:8090 or http://localhost:8080).
HOST: "${WANDB_HOST:-http://localhost:8090}"
# Optional: provide license at container start to avoid UI activation flakiness.
LICENSE: "${WANDB_LICENSE:-}"
ray_worker_0:
image: argus/argus-ray-node:vllm011.latest

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,6 @@
from __future__ import annotations
from .routes import register_ui_routes
__all__ = ["register_ui_routes"]

View File

@ -0,0 +1,2 @@
from __future__ import annotations

View File

@ -0,0 +1,33 @@
from __future__ import annotations
BASE_CSS = """
:root { --bg:#0b1020; --panel:#111a33; --muted:#95a3c6; --fg:#e8eeff; --accent:#7aa2ff; --danger:#ff6b6b; --ok:#3ddc97; }
* { box-sizing: border-box; }
body { margin:0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background:var(--bg); color:var(--fg); }
a { color:var(--accent); text-decoration:none; }
.layout { display:flex; min-height:100vh; }
.nav { width: 240px; padding:16px; background: linear-gradient(180deg, #0e1630, #0b1020); border-right: 1px solid rgba(255,255,255,0.06); }
.brand { font-weight: 700; letter-spacing: .2px; margin-bottom: 12px; }
.nav a { display:block; padding:10px 10px; border-radius:10px; color: var(--fg); opacity: .9; }
.nav a.active { background: rgba(122,162,255,0.14); border: 1px solid rgba(122,162,255,0.22); }
.nav a:hover { background: rgba(255,255,255,0.06); }
.main { flex:1; padding: 20px 24px; }
.card { background: rgba(255,255,255,0.04); border: 1px solid rgba(255,255,255,0.08); border-radius: 14px; padding: 16px; }
.row { display:flex; gap: 12px; align-items:center; flex-wrap: wrap; }
.muted { color: var(--muted); }
.btn { border: 1px solid rgba(255,255,255,0.16); background: rgba(255,255,255,0.06); color: var(--fg); padding: 10px 12px; border-radius: 10px; cursor: pointer; }
.btn:hover { background: rgba(255,255,255,0.10); }
.btn.active { background: rgba(122,162,255,0.14); border-color: rgba(122,162,255,0.22); }
.btn.danger { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.10); }
.pill { display:inline-block; padding: 2px 10px; border-radius: 999px; border: 1px solid rgba(255,255,255,0.16); font-size: 12px; }
.pill.ok { border-color: rgba(61,220,151,0.35); background: rgba(61,220,151,0.12); }
.pill.bad { border-color: rgba(255,107,107,0.35); background: rgba(255,107,107,0.12); }
textarea, input { width: 100%; color: var(--fg); background: rgba(255,255,255,0.06); border: 1px solid rgba(255,255,255,0.12); border-radius: 12px; padding: 10px 12px; outline: none; }
select { color: var(--fg); background: rgba(255,255,255,0.06); border: 1px solid rgba(255,255,255,0.12); border-radius: 12px; padding: 10px 12px; outline: none; }
select option { color: var(--fg); background: #0e1630; }
button:disabled { opacity: .45; cursor: not-allowed; }
pre { white-space: pre-wrap; word-break: break-word; }
table { width:100%; border-collapse: collapse; }
th, td { padding: 10px 8px; border-bottom: 1px solid rgba(255,255,255,0.08); text-align:left; }
""".strip()

View File

@ -0,0 +1,70 @@
from __future__ import annotations
BASE_JS = """
function mvpTokenGet() {
return (localStorage.getItem("mvp_token") || "").trim();
}
function mvpTokenSet(v) {
localStorage.setItem("mvp_token", (v || "").trim());
}
function mvpSftpPasswordGet() {
return (localStorage.getItem("mvp_sftp_password") || "").trim();
}
function mvpSftpPasswordSet(v) {
localStorage.setItem("mvp_sftp_password", (v || "").trim());
}
async function apiFetch(path, opts) {
opts = opts || {};
opts.headers = opts.headers || {};
const tok = mvpTokenGet();
if (tok) opts.headers["Authorization"] = "Bearer " + tok;
return fetch(path, opts);
}
async function apiJson(path, opts) {
const resp = await apiFetch(path, opts);
const text = await resp.text();
if (!resp.ok) {
const err = new Error("HTTP " + resp.status);
err.status = resp.status;
err.body = text;
throw err;
}
return JSON.parse(text);
}
function fmtJson(obj) {
try { return JSON.stringify(obj, null, 2); } catch (e) { return String(obj); }
}
function curOriginWithPort(port) {
const proto = window.location.protocol;
const host = window.location.hostname;
return proto + "//" + host + ":" + port;
}
async function copyText(v) {
if (!v) return false;
try {
await navigator.clipboard.writeText(v);
return true;
} catch (e) {
// Fallback for non-secure contexts (http) or older browsers.
try {
const ta = document.createElement("textarea");
ta.value = v;
ta.style.position = "fixed";
ta.style.opacity = "0";
document.body.appendChild(ta);
ta.focus();
ta.select();
const ok = document.execCommand("copy");
document.body.removeChild(ta);
return ok;
} catch (e2) {
return false;
}
}
}
document.addEventListener("DOMContentLoaded", () => {
const el = document.getElementById("nav-ray-dashboard");
if (el) el.href = curOriginWithPort(8265);
});
""".strip()

View File

@ -0,0 +1,2 @@
from __future__ import annotations

View File

@ -0,0 +1,24 @@
from __future__ import annotations
import html
def nav(active: str) -> str:
links = [
("login", "/ui/login", "Login"),
("tasks", "/ui/tasks", "Tasks"),
("serving", "/ui/serving", "Serving"),
("new", "/ui/tasks/new", "New Task"),
("data", "/ui/data", "Data"),
("admin", "/ui/admin", "Admin"),
("ray", "#", "Ray Dashboard"),
]
items: list[str] = []
for key, href, label in links:
cls = "active" if key == active else ""
extra = ""
if key == "ray":
extra = ' id="nav-ray-dashboard" target="_blank" rel="noopener"'
items.append(f'<a class="{cls}" href="{href}"{extra}>{html.escape(label)}</a>')
return "\n".join(items)

View File

@ -0,0 +1,36 @@
from __future__ import annotations
import html
from ..assets.base_css import BASE_CSS
from ..assets.base_js import BASE_JS
from .nav import nav
def page(title: str, active: str, body: str, script: str = "") -> str:
return f"""<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>{html.escape(title)}</title>
<style>{BASE_CSS}</style>
</head>
<body>
<div class="layout">
<nav class="nav">
<div class="brand">Argus MVP</div>
{nav(active)}
<div style="margin-top:12px" class="muted">
Token stored in browser localStorage as <code>mvp_token</code>.
</div>
</nav>
<main class="main">
{body}
</main>
</div>
<script>{BASE_JS}</script>
<script>{script}</script>
</body>
</html>"""

View File

@ -0,0 +1,2 @@
from __future__ import annotations

View File

@ -0,0 +1,124 @@
from __future__ import annotations
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/admin")
async def ui_admin() -> HTMLResponse:
body = """
<h1>Admin</h1>
<div class="card">
<div class="muted">
This page requires the <code>admin</code> token (set it in <a href="/ui/login">Login</a>).
</div>
<div style="height:14px"></div>
<h3>Create user</h3>
<div class="row">
<input id="new-user-id" placeholder="user_id (e.g. alice)" style="max-width:320px" />
<input id="new-display-name" placeholder="display_name (optional)" style="max-width:320px" />
<button class="btn" id="create-user">Create</button>
</div>
<div style="height:10px"></div>
<pre id="create-msg" class="muted"></pre>
<div style="height:14px"></div>
<div class="row">
<button class="btn" id="refresh">Refresh</button>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
const out = document.getElementById("out");
const createMsg = document.getElementById("create-msg");
const userIdEl = document.getElementById("new-user-id");
const displayNameEl = document.getElementById("new-display-name");
function esc(s) {
s = String(s || "");
return s.replaceAll("&","&amp;").replaceAll("<","&lt;").replaceAll(">","&gt;");
}
async function refresh() {
out.textContent = "Loading...";
try {
const obj = await apiJson("/api/v2/users?limit=200");
const users = (obj.users || []);
function row(u) {
const uid = u.user_id;
const tok = u.token || "";
const tokShort = tok ? (tok.length > 18 ? (tok.slice(0, 18) + "") : tok) : "";
const created = u.created_at || "";
const updated = u.updated_at || "";
const tCreated = u.token_created_at || "";
const tUsed = u.token_last_used_at || "";
return `<tr>
<td><code>${esc(uid)}</code></td>
<td class="muted">${esc(created)}</td>
<td class="muted">${esc(updated)}</td>
<td>
<div class="row" style="gap:8px">
<code title="${esc(tok)}">${esc(tokShort)}</code>
<button class="btn" data-copy="${esc(tok)}">Copy</button>
<button class="btn" data-issue="${esc(uid)}">Issue token</button>
</div>
<div class="muted" style="margin-top:6px">token_created_at: ${esc(tCreated)}; last_used_at: ${esc(tUsed)}</div>
</td>
</tr>`;
}
out.innerHTML = `
<table>
<thead><tr><th>User</th><th>Created</th><th>Updated</th><th>Token</th></tr></thead>
<tbody>${users.map(row).join("") || "<tr><td colspan=4 class=muted>(none)</td></tr>"}</tbody>
</table>
`;
for (const btn of out.querySelectorAll("button[data-copy]")) {
btn.onclick = async () => { await copyText(btn.getAttribute("data-copy") || ""); };
}
for (const btn of out.querySelectorAll("button[data-issue]")) {
btn.onclick = async () => {
const uid = btn.getAttribute("data-issue");
if (!uid) return;
try {
const r = await apiJson("/api/v2/users/" + encodeURIComponent(uid) + "/tokens", { method: "POST" });
createMsg.textContent = "Issued token:\\n" + fmtJson(r);
await refresh();
} catch (e) {
createMsg.textContent = "Error issuing token: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
}
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
}
document.getElementById("refresh").onclick = refresh;
document.getElementById("create-user").onclick = async () => {
createMsg.textContent = "Creating...";
const user_id = (userIdEl.value || "").trim();
const display_name = (displayNameEl.value || "").trim();
if (!user_id) { createMsg.textContent = "user_id is required"; return; }
const payload = { user_id: user_id };
if (display_name) payload.display_name = display_name;
try {
const r = await apiJson("/api/v2/users", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(payload) });
createMsg.textContent = "Created:\\n" + fmtJson(r);
userIdEl.value = "";
displayNameEl.value = "";
await refresh();
} catch (e) {
createMsg.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
refresh();
""".strip()
return HTMLResponse(content=page("Admin", "admin", body, script))

View File

@ -0,0 +1,93 @@
from __future__ import annotations
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/data")
async def ui_data() -> HTMLResponse:
body = """
<h1>Data</h1>
<div class="card">
<div class="muted">User files live under your home directory. Keep long-term artifacts in <code>models/</code> or <code>datasets/</code>.</div>
<div style="height:14px"></div>
<div class="row">
<div style="flex:1; min-width:260px">
<div class="muted">Username</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<input id="u" readonly />
<button class="btn" id="copy-u">Copy</button>
</div>
</div>
<div style="flex:1; min-width:260px">
<div class="muted">SFTPGo password</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<input id="p" placeholder="Click Reset to generate..." />
<button class="btn" id="copy-p">Copy</button>
<button class="btn" id="reset-p">Reset</button>
</div>
</div>
</div>
<div style="height:12px"></div>
<div class="row">
<a class="btn" id="sftp-web" target="_blank" rel="noopener" href="#">Open SFTPGo Web Client (:8081)</a>
</div>
<div style="height:12px"></div>
<div class="muted">
You can also use an SFTP client (e.g. FileZilla) with the same username/password.
Host: <code id="sftp-host"></code>, Port: <code id="sftp-port"></code>.
</div>
<div style="height:14px"></div>
<div id="msg" class="muted"></div>
</div>
""".strip()
script = """
const msg = document.getElementById("msg");
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const u = document.getElementById("u");
const p = document.getElementById("p");
const sftpWeb = document.getElementById("sftp-web");
const sftpHost = document.getElementById("sftp-host");
const sftpPort = document.getElementById("sftp-port");
document.getElementById("copy-u").onclick = async () => { await copyText(u.value || ""); };
document.getElementById("copy-p").onclick = async () => { await copyText(p.value || ""); };
async function refresh() {
const resp = await apiFetch("/api/v2/me");
const text = await resp.text();
if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
u.value = (obj.user_id || "");
const cached = mvpSftpPasswordGet();
if (cached) p.value = cached;
const host = curOriginWithPort(8081);
sftpWeb.href = host + "/web/client";
sftpHost.textContent = (obj.sftp && obj.sftp.host) ? obj.sftp.host : window.location.hostname;
sftpPort.textContent = (obj.sftp && obj.sftp.port) ? String(obj.sftp.port) : "2022";
msg.textContent = "";
}
document.getElementById("reset-p").onclick = async () => {
p.value = "";
mvpSftpPasswordSet("");
msg.textContent = "Resetting...";
const resp = await apiFetch("/api/v2/me/sftp:reset_password", { method: "POST" });
const text = await resp.text();
if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
p.value = obj.password || "";
mvpSftpPasswordSet(p.value);
msg.textContent = "SFTPGo password rotated.";
};
refresh();
""".strip()
return HTMLResponse(content=page("Data", "data", body, script))

View File

@ -0,0 +1,88 @@
from __future__ import annotations
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/login")
async def ui_login() -> HTMLResponse:
body = """
<h1>Login</h1>
<div class="card">
<div class="muted">Paste your API token (without the <code>Bearer</code> prefix).</div>
<div style="height:10px"></div>
<input id="tok" placeholder="token..." />
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="save">Save</button>
<button class="btn" id="clear">Clear</button>
<a href="/ui/tasks" class="btn" style="display:inline-block">Go to Tasks</a>
</div>
<div style="height:10px"></div>
<div id="msg" class="muted"></div>
</div>
<div style="height:12px"></div>
<div class="card">
<h3 style="margin-top:0">User Info</h3>
<div class="muted">Shown after login via <code>/api/v2/me</code>.</div>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="me-refresh">Refresh</button>
</div>
<div style="height:10px"></div>
<pre id="me" class="muted">(not loaded)</pre>
</div>
<div style="height:12px"></div>
<div class="card">
<h3 style="margin-top:0">W&B</h3>
<div class="muted">Weights &amp; Biases local server (v3.6). Metrics are written by training jobs; this UI is for viewing.</div>
<div style="height:10px"></div>
<div class="row">
<a class="btn" id="wandb-open" target="_blank" rel="noopener" href="#">Open W&amp;B (:8090)</a>
<button class="btn" id="wandb-copy-project">Copy project</button>
</div>
<div style="height:10px"></div>
<div class="muted">project: <code id="wandb-project">(unknown)</code></div>
<div class="muted">base_url (job runtime): <code id="wandb-base-url">(unknown)</code></div>
</div>
""".strip()
script = """
const tokEl = document.getElementById("tok");
const msg = document.getElementById("msg");
const me = document.getElementById("me");
const wandbOpen = document.getElementById("wandb-open");
const wandbProject = document.getElementById("wandb-project");
const wandbBaseUrl = document.getElementById("wandb-base-url");
document.getElementById("wandb-copy-project").onclick = async () => { await copyText(wandbProject.textContent || ""); };
wandbOpen.href = curOriginWithPort(8090);
tokEl.value = mvpTokenGet();
async function refreshMe() {
me.textContent = "Loading...";
try {
const obj = await apiJson("/api/v2/me");
me.textContent = fmtJson(obj);
if (obj.wandb && obj.wandb.enabled) {
wandbProject.textContent = obj.wandb.project_name || "(unknown)";
wandbBaseUrl.textContent = obj.wandb.base_url || "(unknown)";
} else {
wandbProject.textContent = "(disabled)";
wandbBaseUrl.textContent = "(disabled)";
}
} catch (e) {
me.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
wandbProject.textContent = "(error)";
wandbBaseUrl.textContent = "(error)";
}
}
document.getElementById("me-refresh").onclick = refreshMe;
document.getElementById("save").onclick = () => { mvpTokenSet(tokEl.value); msg.textContent = "Saved."; refreshMe(); };
document.getElementById("clear").onclick = () => { mvpTokenSet(""); tokEl.value = ""; msg.textContent = "Cleared."; me.textContent = "(not loaded)"; };
if (mvpTokenGet()) refreshMe();
""".strip()
return HTMLResponse(content=page("Login", "login", body, script))

View File

@ -0,0 +1,261 @@
from __future__ import annotations
import html
import json
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/serving")
async def ui_serving() -> HTMLResponse:
body = """
<h1>Serving</h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<a class="btn" href="/ui/serving/new" style="display:inline-block">New Model</a>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
function pill(state) {
const s = String(state || "");
if (s === "RUNNING") return `<span class="pill ok">${s}</span>`;
if (s === "FAILED") return `<span class="pill bad">${s}</span>`;
return `<span class="pill">${s}</span>`;
}
async function refresh() {
out.textContent = "Loading...";
try {
const lim = 50;
const off = Number(localStorage.getItem("mvp_serving_offset") || "0") || 0;
const resp = await apiJson("/api/v2/serve/models?limit=" + lim + "&offset=" + off + "&include_deleted=0");
const items = resp.items || [];
const hasMore = !!resp.has_more;
const pageNo = Math.floor(off / lim) + 1;
const prevDisabled = off <= 0;
const nextDisabled = !hasMore;
const baseHint = curOriginWithPort(8000) + "/serve/<model_key>/v1";
function row(m) {
const endpoint = (m.endpoint && m.endpoint.openai_base_url) || (curOriginWithPort(8000) + "/serve/" + encodeURIComponent(m.model_key) + "/v1");
return `<tr>
<td><a href="/ui/serving/${m.model_key}">${m.model_key}</a></td>
<td><code>${m.model_id}</code></td>
<td>${pill(m.state)}</td>
<td>${m.num_replicas} × ${m.gpus_per_replica} GPU</td>
<td><code>${endpoint}</code></td>
<td>${m.updated_at || ""}</td>
</tr>`;
}
const rows = items.map(row).join("");
out.innerHTML = `
<div class="row" style="justify-content: space-between; margin-bottom: 8px;">
<div class="muted">Per-model OpenAI base: <code>${baseHint}</code></div>
<div class="row">
<span class="muted">Page ${pageNo}</span>
<button class="btn" id="prev" ${prevDisabled ? "disabled" : ""}>Prev</button>
<button class="btn" id="next" ${nextDisabled ? "disabled" : ""}>Next</button>
</div>
</div>
<table>
<thead><tr><th>Model Key</th><th>Model ID</th><th>State</th><th>Resources</th><th>Endpoint</th><th>Updated</th></tr></thead>
<tbody>${rows || "<tr><td colspan=6 class=muted>(none)</td></tr>"}</tbody>
</table>
`;
const prevBtn = document.getElementById("prev");
const nextBtn = document.getElementById("next");
if (prevBtn) prevBtn.onclick = () => { localStorage.setItem("mvp_serving_offset", String(Math.max(0, off - lim))); refresh(); };
if (nextBtn) nextBtn.onclick = () => { localStorage.setItem("mvp_serving_offset", String(off + lim)); refresh(); };
} catch (e) {
let text = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
if (e.body && String(e.body).includes("serving is not enabled")) {
text = "Serving is not enabled in server config.\\nAsk admin to enable `serving:` in dev.yaml.";
}
out.textContent = text;
}
}
document.getElementById("refresh").onclick = refresh;
refresh();
""".strip()
return HTMLResponse(content=page("Serving", "serving", body, script))
@app.get("/ui/serving/new")
async def ui_serving_new() -> HTMLResponse:
example = """# ServingSpec (YAML)
# 说明:
# - model_id: 这里是 suffix平台会自动加前缀<user_id>-<YYYYMMDDHHMM>-<suffix>
# - model_source: 本地模型路径(支持 $HOME 宏;推荐使用 $HOME/common/hf 指向共享 HF cache
#
# 常用路径:
# - $HOME/common/hf -> /private/hf
# - $HOME -> /private/users/<user_id>
#
model_id: qwen-0.5b
model_source: $HOME/common/hf/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/<SNAPSHOT_HASH>
num_replicas: 1
gpus_per_replica: 1
# engine_kwargs: # 可选:透传给 vLLM
# gpu_memory_utilization: 0.4
""".strip()
body = f"""
<h1>New Model</h1>
<div class="card">
<div class="muted">Paste ServingSpec YAML and submit to <code>/api/v2/serve/models</code>.</div>
<div style="height:10px"></div>
<textarea id="yaml" rows="14">{html.escape(example)}</textarea>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="submit">Submit</button>
<a class="btn" href="/ui/serving" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted"></pre>
</div>
""".strip()
script = """
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
document.getElementById("submit").onclick = async () => {
out.textContent = "Submitting...";
const yaml = document.getElementById("yaml").value || "";
try {
const resp = await apiJson("/api/v2/serve/models", { method: "POST", headers: { "Content-Type": "application/yaml" }, body: yaml });
out.textContent = "Created: " + resp.model_key + "\\nState: " + resp.state;
if (resp.model_key) window.location.href = "/ui/serving/" + encodeURIComponent(resp.model_key);
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
};
""".strip()
return HTMLResponse(content=page("New Model", "serving", body, script))
@app.get("/ui/serving/{model_key}")
async def ui_serving_detail(model_key: str) -> HTMLResponse:
body = f"""
<h1>Model</h1>
<div class="card">
<div class="row" style="justify-content: space-between;">
<div class="muted">model_key: <code>{html.escape(model_key)}</code></div>
<div class="row">
<a class="btn" href="/ui/serving" style="display:inline-block">Back</a>
<a class="btn" id="openai-models" target="_blank" rel="noopener" href="#">OpenAI /v1/models</a>
</div>
</div>
<div style="height:10px"></div>
<div class="row">
<label class="muted" style="min-width:120px">Scale replicas</label>
<input id="replicas" type="number" min="1" step="1" value="1" style="max-width: 180px" />
<button class="btn" id="scale">Apply</button>
<button class="btn danger" id="delete">Delete</button>
</div>
<div style="height:10px"></div>
<div id="meta" class="muted">Loading...</div>
<div style="height:12px"></div>
<h3 style="margin-top:0">Resolved Spec (YAML)</h3>
<pre id="spec" class="muted">(loading)</pre>
<div style="height:12px"></div>
<h3 style="margin-top:0">Events</h3>
<div id="events" class="muted">(loading)</div>
<div style="height:12px"></div>
<h3 style="margin-top:0">OpenAI Example</h3>
<pre id="example" class="muted">(loading)</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const modelKey = {json.dumps(model_key)};
document.getElementById("openai-models").href = curOriginWithPort(8000) + "/serve/" + encodeURIComponent(modelKey) + "/v1/models";
const meta = document.getElementById("meta");
const spec = document.getElementById("spec");
const eventsEl = document.getElementById("events");
const example = document.getElementById("example");
const replicas = document.getElementById("replicas");
function pill(state) {{
const s = String(state || "");
if (s === "RUNNING") return `<span class="pill ok">${{s}}</span>`;
if (s === "FAILED") return `<span class="pill bad">${{s}}</span>`;
return `<span class="pill">${{s}}</span>`;
}}
function renderEvents(events) {{
if (!events || !events.length) return "<div class=muted>(none)</div>";
const rows = events.map(e => {{
const payload = (e.payload_json || "");
const short = String(payload).length > 240 ? String(payload).slice(0, 240) + "..." : String(payload);
return `<tr><td>${{e.created_at || ""}}</td><td><code>${{e.event_type}}</code></td><td><pre class=muted style=\\"margin:0\\">${{short}}</pre></td></tr>`;
}}).join("");
return `<table><thead><tr><th>Time</th><th>Type</th><th>Payload</th></tr></thead><tbody>${{rows}}</tbody></table>`;
}}
async function refresh() {{
meta.textContent = "Loading...";
spec.textContent = "(loading)";
eventsEl.textContent = "(loading)";
example.textContent = "(loading)";
try {{
const obj = await apiJson("/api/v2/serve/models/" + encodeURIComponent(modelKey));
const m = obj.model || {{}};
replicas.value = String(m.num_replicas || 1);
const endpoint = (m.endpoint && m.endpoint.openai_base_url) || (curOriginWithPort(8000) + "/serve/" + encodeURIComponent(modelKey) + "/v1");
meta.innerHTML = `
<div class=row>
<div>state: ${{pill(m.state)}}</div>
<div class=muted>model_id: <code>${{m.model_id || ""}}</code></div>
<div class=muted>source: <code>${{m.model_source || ""}}</code></div>
</div>
<div class=muted>endpoint: <code>${{endpoint}}</code></div>
`;
spec.textContent = obj.resolved_spec_yaml || "";
eventsEl.innerHTML = renderEvents(obj.events || []);
const base = endpoint;
const mid = m.model_id || "";
example.textContent = `curl -sS -H 'Content-Type: application/json' \\\\\\n -X POST ${{base}}/chat/completions \\\\\\n --data-binary '{{\\"model\\":\\"${{mid}}\\",\\"messages\\":[{{\\"role\\":\\"user\\",\\"content\\":\\"hello\\"}}],\\"max_tokens\\":16,\\"stream\\":false}}' | python3 -m json.tool`;
}} catch (e) {{
meta.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
spec.textContent = "";
eventsEl.textContent = "";
example.textContent = "";
}}
}}
document.getElementById("scale").onclick = async () => {{
const n = Number(replicas.value || "1");
if (!Number.isFinite(n) || n < 1) return;
try {{
await apiJson("/api/v2/serve/models/" + encodeURIComponent(modelKey), {{ method: "PATCH", headers: {{ "Content-Type": "application/json" }}, body: JSON.stringify({{ num_replicas: n }}) }});
await refresh();
}} catch (e) {{
meta.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}}
}};
document.getElementById("delete").onclick = async () => {{
if (!confirm("Delete this model?")) return;
try {{
await apiJson("/api/v2/serve/models/" + encodeURIComponent(modelKey), {{ method: "DELETE" }});
await refresh();
}} catch (e) {{
meta.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}}
}};
refresh();
""".strip()
return HTMLResponse(content=page("Model", "serving", body, script))

View File

@ -0,0 +1,88 @@
from __future__ import annotations
import html
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/tasks/{task_id}")
async def ui_task_detail(task_id: str) -> HTMLResponse:
safe_id = html.escape(task_id)
body = f"""
<h1>Task: <code>{safe_id}</code></h1>
<div class="card">
<div class="row">
<a class="btn" href="/ui/tasks/{safe_id}/logs" style="display:inline-block">Logs</a>
<button class="btn" id="refresh">Refresh</button>
<button class="btn danger" id="cancel">Cancel</button>
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
<div style="height:12px"></div>
<div class="card">
<h3 style="margin-top:0">W&amp;B</h3>
<div class="muted">This is a best-effort hint. Run name maps to <code>ray_submission_id</code> (attempt).</div>
<div style="height:10px"></div>
<div class="row">
<a class="btn" id="wandb-open" target="_blank" rel="noopener" href="#">Open W&amp;B (:8090)</a>
<button class="btn" id="wandb-copy-run">Copy run</button>
</div>
<div style="height:10px"></div>
<div class="muted">project: <code id="wandb-project">(unknown)</code></div>
<div class="muted">run: <code id="wandb-run">(unknown)</code></div>
</div>
<div style="height:12px"></div>
<div class="card">
<h3 style="margin-top:0">TaskSpec (YAML)</h3>
<div class="muted">Resolved TaskSpec (includes default values; submission_id reflects latest attempt when available).</div>
<div style="height:10px"></div>
<pre id="spec" class="muted">Loading...</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
document.getElementById("wandb-open").href = curOriginWithPort(8090);
document.getElementById("wandb-copy-run").onclick = async () => {{ await copyText((document.getElementById("wandb-run").textContent || \"\").trim()); }};
const out = document.getElementById("out");
const spec = document.getElementById("spec");
async function refresh() {{
out.textContent = "Loading...";
spec.textContent = "Loading...";
const resp = await apiFetch("/api/v2/tasks/{task_id}");
const text = await resp.text();
if (!resp.ok) {{ out.textContent = "Error: " + resp.status + "\\n" + text; return; }}
const obj = JSON.parse(text);
out.textContent = fmtJson(obj);
const p = document.getElementById("wandb-project");
const r = document.getElementById("wandb-run");
const w = (obj.latest_attempt && obj.latest_attempt.wandb) ? obj.latest_attempt.wandb : null;
if (w) {{
p.textContent = w.project_name || "(unknown)";
r.textContent = w.run_name || "(unknown)";
}} else {{
p.textContent = "(not available)";
r.textContent = "(not available)";
}}
const resp2 = await apiFetch("/api/v2/tasks/{task_id}/spec");
const text2 = await resp2.text();
spec.textContent = resp2.ok ? text2 : ("Error: " + resp2.status + "\\n" + text2);
}}
document.getElementById("refresh").onclick = refresh;
document.getElementById("cancel").onclick = async () => {{
if (!confirm("Cancel this task?")) return;
const resp = await apiFetch("/api/v2/tasks/{task_id}:cancel", {{ method: "POST" }});
const text = await resp.text();
out.textContent = (resp.ok ? "Canceled.\\n" : "Error: " + resp.status + "\\n") + text;
setTimeout(refresh, 800);
}};
refresh();
""".strip()
return HTMLResponse(content=page(f"Task {task_id}", "tasks", body, script))

View File

@ -0,0 +1,48 @@
from __future__ import annotations
import html
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/tasks/{task_id}/logs")
async def ui_task_logs(task_id: str) -> HTMLResponse:
safe_id = html.escape(task_id)
body = f"""
<h1>Logs: <code>{safe_id}</code></h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<label class="muted">Auto refresh <input type="checkbox" id="auto" /></label>
<a class="btn" href="/ui/tasks/{safe_id}" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="out" class="muted">Loading...</pre>
</div>
""".strip()
script = f"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const out = document.getElementById("out");
let timer = null;
async function refresh() {{
const resp = await apiFetch("/api/v2/tasks/{task_id}/logs?tail=4000");
const text = await resp.text();
out.textContent = resp.ok ? text : ("Error: " + resp.status + "\\n" + text);
}}
document.getElementById("refresh").onclick = refresh;
document.getElementById("auto").onchange = (e) => {{
if (e.target.checked) {{
timer = setInterval(refresh, 2000);
}} else {{
if (timer) clearInterval(timer);
timer = null;
}}
}};
refresh();
""".strip()
return HTMLResponse(content=page(f"Logs {task_id}", "tasks", body, script))

View File

@ -0,0 +1,538 @@
from __future__ import annotations
import html
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/tasks/new")
async def ui_new_task() -> HTMLResponse:
ppo = """# PPO TaskSpec (YAML)
workload: ppo # 任务类型必填ppogrposft
code_path: /private/common/code/verl/verl_repo # 代码路径必填v3.0 固定使用 common 下的 verl 快照(不支持用户自定义代码)
model_id: Qwen/Qwen2.5-0.5B-Instruct # 基础模型必填HuggingFace 模型 ID 或 /private/... 本地模型路径
train_file: /private/common/datasets/gsm8k/train.parquet # 训练数据必填parquet 文件路径(支持 /private/common/datasets 或 /private/users/<user>/datasets
val_file: /private/common/datasets/gsm8k/test.parquet # 验证数据必填parquet 文件路径VERL 侧会用来构建 val dataset不能为 null
# nnodes: 2 # 训练节点数可选默认2
# n_gpus_per_node: 4 # 每节点 GPU 数可选默认4
# total_epochs: 1 # 总训练 epoch可选默认1
# total_training_steps: null # 总训练 step可选默认null不传则让 VERL 按 epochs 和数据长度自动推导)
# save_freq: 10 # checkpoint 保存频率step可选默认10
# test_freq: null # 验证频率step可选默认null训练端会当成 -1=不验证)
# submission_id: "" # Ray submission_id可选默认空通常由服务自动生成无需填写
""".strip()
grpo = """# GRPO TaskSpec (YAML)
workload: grpo # 任务类型必填ppogrposftgrpo 会自动启用对应的算法配置)
code_path: /private/common/code/verl/verl_repo # 代码路径必填v3.0 固定使用 common 下的 verl 快照(不支持用户自定义代码)
model_id: Qwen/Qwen2.5-0.5B-Instruct # 基础模型必填HuggingFace 模型 ID 或 /private/... 本地模型路径
train_file: /private/common/datasets/gsm8k/train.parquet # 训练数据必填parquet 文件路径(支持 /private/common/datasets 或 /private/users/<user>/datasets
val_file: /private/common/datasets/gsm8k/test.parquet # 验证数据必填parquet 文件路径VERL 侧会用来构建 val dataset不能为 null
# nnodes: 2 # 训练节点数可选默认2
# n_gpus_per_node: 4 # 每节点 GPU 数可选默认4
# total_epochs: 1 # 总训练 epoch可选默认1
# total_training_steps: null # 总训练 step可选默认null不传则让 VERL 按 epochs 和数据长度自动推导)
# save_freq: 10 # checkpoint 保存频率step可选默认10
# test_freq: null # 验证频率step可选默认null训练端会当成 -1=不验证)
# submission_id: "" # Ray submission_id可选默认空通常由服务自动生成无需填写
""".strip()
sft = """# SFT TaskSpec (YAML)
workload: sft # 任务类型必填ppogrposft
code_path: /private/common/code/verl/verl_repo # 代码路径必填v3.0 固定使用 common 下的 verl 快照(不支持用户自定义代码)
model_id: Qwen/Qwen2.5-0.5B-Instruct # 基础模型必填HuggingFace 模型 ID 或 /private/... 本地模型路径
train_file: /private/common/datasets/gsm8k_sft/train.parquet # 训练数据必填parquet 文件路径(支持 /private/common/datasets 或 /private/users/<user>/datasets
val_file: /private/common/datasets/gsm8k_sft/test.parquet # 验证数据必填parquet 文件路径VERL 侧会用来构建 val dataset不能为 null
# nnodes: 2 # 训练节点数可选默认2单机可设 1
# n_gpus_per_node: 4 # 每节点 GPU 数可选默认4单卡可设 1
# total_epochs: 1 # 总训练 epoch可选默认1
# total_training_steps: null # 总训练 step可选默认null不传则让 VERL 按 epochs 和数据长度自动推导)
# save_freq: 10 # checkpoint 保存频率step可选默认10
# test_freq: null # 验证频率step可选默认null训练端会当成 -1=不验证)
# trainer_device: cpu # 仅 SFT 生效driver 侧 device可选默认cpu
# submission_id: "" # Ray submission_id可选默认空通常由服务自动生成无需填写
""".strip()
adv = """# Advanced TaskSpec (YAML) - v3.6
kind: advanced # 任务类型必填advanced自定义 command
# 说明:平台统一按 "advanced" 做任务分类与 task_id 命名(不按 ppo/grpo/sft 细分)。
#
# 自定义训练命令:平台会做 $HOME 宏替换:
# - $HOME -> /private/users/<user>
# - $HOME/common/datasets -> /private/datasets共享只读数据
# - $HOME/common/hf -> /private/hf共享只读 HF cache
#
# W&Bv3.6
# - 平台会在 runtime_env 注入 WANDB_BASE_URL/WANDB_API_KEY/WANDB_DIR
# - 以及注入以下 env供 Advanced command 使用(无需用户手动修改 project
# - MVP_TRAINER_LOGGER: "console" 或 "[console,wandb]"
# - MVP_WANDB_PROJECT: "<user_id>_project"
# - MVP_WANDB_RUN: "<ray_submission_id>"
#
nnodes: 2 # 训练节点数(必填):用于平台队列调度与资源预检查
n_gpus_per_node: 4 # 每节点 GPU 数(必填):用于平台队列调度与资源预检查
command: |
PYTHONUNBUFFERED=1 \
python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/common/datasets/gsm8k/train.parquet \
data.val_files=$HOME/common/datasets/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=512 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=${MVP_TRAINER_LOGGER} \
trainer.project_name=${MVP_WANDB_PROJECT} \
trainer.experiment_name=${MVP_WANDB_RUN} \
trainer.val_before_train=False \
trainer.nnodes=2 \
trainer.n_gpus_per_node=4 \
trainer.total_epochs=1 \
trainer.total_training_steps=10 \
trainer.save_freq=10 \
trainer.test_freq=-1 \
trainer.resume_mode=disable \
trainer.default_local_dir=checkpoints \
+ray_kwargs.ray_init.address=auto \
hydra.run.dir=logs/hydra
# 可选:自定义 reward方式 A直接写在 command 里)
# command 里增加如下 overrides
# custom_reward_function.path=$HOME/code/reward.py
# custom_reward_function.name=compute_score
""".strip()
merge = """# Model Merge (YAML) - v3.5 (Advanced command)
# 用途:将 VERL 训练产生的 FSDP 分片 checkpoint 合并为 HuggingFace 格式目录。
#
# 你需要把 <RAY_SUBMISSION_ID> 和 <GLOBAL_STEP> 替换成真实值:
# - submission id在 Tasks 详情页里看到的 `ray_submission_id`(如 mvp2-...--a01
# - global_step对应 checkpoints 下的 global_step_xxx 目录(如 global_step_10
#
# 注意:这里不需要 GPU建议把 n_gpus_per_node 设为 0这样不受训练任务占用 GPU 的影响。
kind: advanced
nnodes: 1
n_gpus_per_node: 0
command: |
python3 -m verl.model_merger merge \
--backend fsdp \
--local_dir $HOME/jobs/<RAY_SUBMISSION_ID>/checkpoints/<GLOBAL_STEP>/actor \
--target_dir $HOME/jobs/<RAY_SUBMISSION_ID>/checkpoints/<GLOBAL_STEP>/actor/huggingface
""".strip()
evaluation = """# Evaluation (YAML) - v3.6 (Advanced command)
# 用途:对“已生成的结果 parquet”做离线评估基于 custom reward function示例使用 VERL 内置 main_eval。
#
# 说明:
# - 你需要准备一个包含生成 responses 的 parquet并替换 <EVAL_PARQUET>(示例放在某个 job 目录下)。
# - main_eval 会在 Ray 上并发计算,因此这里也用 +ray_kwargs 连接已有 Ray 集群。
#
kind: advanced
nnodes: 1
n_gpus_per_node: 0
command: |
PYTHONUNBUFFERED=1 \
python3 -m verl.trainer.main_eval \
data.path=$HOME/jobs/<RAY_SUBMISSION_ID>/outputs/<EVAL_PARQUET>.parquet \
+ray_kwargs.ray_init.address=auto
# 可选:指定 parquet schema按你的 parquet 列名调整)
# data.response_key=responses
# data.data_source_key=data_source
# data.reward_model_key=reward_model
#
# 可选:自定义 reward方式 A用户自行提供 reward.py
# custom_reward_function.path=$HOME/code/reward.py
# custom_reward_function.name=compute_score
""".strip()
body = f"""
<h1>New Task</h1>
<div class="card">
<div class="muted">
Paste TaskSpec YAML and submit to API server.
Basic tasks require <code>code_path</code>; Advanced tasks use <code>kind: advanced</code> with a custom <code>command</code>.
</div>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="mode-yaml">YAML 模式</button>
<button class="btn" id="mode-form">表单模式</button>
</div>
<div style="height:10px"></div>
<div id="panel-yaml">
<div class="row">
<button class="btn" id="tpl-ppo">PPO example</button>
<button class="btn" id="tpl-grpo">GRPO example</button>
<button class="btn" id="tpl-sft">SFT example</button>
<button class="btn" id="tpl-adv">Advanced example</button>
<button class="btn" id="tpl-eval">Evaluation example</button>
<button class="btn" id="tpl-merge">Model merge example</button>
</div>
<div style="height:10px"></div>
<textarea id="yaml" rows="16">{html.escape(ppo)}</textarea>
</div>
<div id="panel-form" style="display:none">
<div class="row">
<div style="min-width:240px">
<div class="muted">模板</div>
<div style="height:6px"></div>
<select id="f-tpl" style="width:240px; padding:10px 12px; border-radius: 12px; background: rgba(255,255,255,0.06); color: var(--fg); border: 1px solid rgba(255,255,255,0.12);">
<option value="ppo">PPO</option>
<option value="grpo">GRPO</option>
<option value="sft">SFT</option>
<option value="advanced">Advanced</option>
<option value="merge">Model Merge</option>
</select>
</div>
<div class="muted" style="flex:1; min-width:260px">
表单模式会生成 TaskSpec YAML 并用同一个 API 提交可随时切回 YAML 模式查看/编辑生成结果
</div>
</div>
<div style="height:14px"></div>
<div id="f-basic">
<div class="row">
<div style="flex:1; min-width:260px">
<div class="muted">code_path</div>
<div style="height:6px"></div>
<input id="f-code-path" />
</div>
<div style="flex:1; min-width:260px">
<div class="muted">model_id</div>
<div style="height:6px"></div>
<input id="f-model-id" />
</div>
</div>
<div style="height:10px"></div>
<div class="row">
<div style="flex:1; min-width:260px">
<div class="muted">train_file</div>
<div style="height:6px"></div>
<input id="f-train-file" />
</div>
<div style="flex:1; min-width:260px">
<div class="muted">val_file</div>
<div style="height:6px"></div>
<input id="f-val-file" />
</div>
</div>
<div style="height:10px"></div>
<div class="row">
<div style="min-width:220px">
<div class="muted">nnodes</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<button class="btn" id="f-nnodes-dec">-</button>
<input id="f-nnodes" type="number" min="1" step="1" style="width:120px" />
<button class="btn" id="f-nnodes-inc">+</button>
</div>
</div>
<div style="min-width:240px">
<div class="muted">n_gpus_per_node</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<button class="btn" id="f-gpn-dec">-</button>
<input id="f-gpn" type="number" min="0" step="1" style="width:120px" />
<button class="btn" id="f-gpn-inc">+</button>
</div>
</div>
<div style="min-width:220px">
<div class="muted">total_epochs</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<button class="btn" id="f-epochs-dec">-</button>
<input id="f-epochs" type="number" min="1" step="1" style="width:120px" />
<button class="btn" id="f-epochs-inc">+</button>
</div>
</div>
<div style="min-width:260px">
<div class="muted">total_training_steps可选</div>
<div style="height:6px"></div>
<input id="f-steps" type="number" min="1" step="1" placeholder="留空表示不传(让 VERL 自动推导)" />
</div>
</div>
<div style="height:10px"></div>
<div class="row">
<div style="min-width:220px">
<div class="muted">save_freq</div>
<div style="height:6px"></div>
<input id="f-save-freq" type="number" min="1" step="1" style="width:160px" />
</div>
<div style="min-width:260px">
<div class="muted">test_freq可选</div>
<div style="height:6px"></div>
<input id="f-test-freq" type="number" min="1" step="1" placeholder="留空表示不传(训练端会当成 -1=不验证)" />
</div>
<div id="f-sft-device-wrap" style="min-width:260px; display:none">
<div class="muted">trainer_device SFT</div>
<div style="height:6px"></div>
<select id="f-trainer-device" style="width:200px; padding:10px 12px; border-radius: 12px; background: rgba(255,255,255,0.06); color: var(--fg); border: 1px solid rgba(255,255,255,0.12);">
<option value="cpu">cpu</option>
<option value="cuda">cuda</option>
</select>
</div>
</div>
</div>
<div id="f-adv" style="display:none">
<div class="row">
<div style="min-width:220px">
<div class="muted">nnodes</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<button class="btn" id="f-adv-nnodes-dec">-</button>
<input id="f-adv-nnodes" type="number" min="1" step="1" style="width:120px" />
<button class="btn" id="f-adv-nnodes-inc">+</button>
</div>
</div>
<div style="min-width:240px">
<div class="muted">n_gpus_per_node</div>
<div style="height:6px"></div>
<div class="row" style="gap:8px">
<button class="btn" id="f-adv-gpn-dec">-</button>
<input id="f-adv-gpn" type="number" min="0" step="1" style="width:120px" />
<button class="btn" id="f-adv-gpn-inc">+</button>
</div>
</div>
</div>
<div style="height:10px"></div>
<div class="muted">command</div>
<div style="height:6px"></div>
<textarea id="f-command" rows="10"></textarea>
<div style="height:10px"></div>
<div class="muted">
Tips:
Advanced 模式只允许执行 <code>python3 -m verl.*</code>
如果你需要自定义 reward可把 <code>reward.py</code> 上传到 <code>$HOME/code/</code> 并在 command 里引用
</div>
</div>
</div>
<div style="height:10px"></div>
<div class="row">
<button class="btn" id="submit">Submit</button>
<a class="btn" href="/ui/tasks" style="display:inline-block">Back</a>
</div>
<div style="height:10px"></div>
<pre id="msg" class="muted"></pre>
</div>
""".strip()
tpl_ppo = ppo.replace("\n", "\\n").replace('"', '\\"')
tpl_grpo = grpo.replace("\n", "\\n").replace('"', '\\"')
tpl_sft = sft.replace("\n", "\\n").replace('"', '\\"')
tpl_adv = adv.replace("\n", "\\n").replace('"', '\\"')
tpl_eval = evaluation.replace("\n", "\\n").replace('"', '\\"')
tpl_merge = merge.replace("\n", "\\n").replace('"', '\\"')
script = (
"""
document.getElementById("nav-ray-dashboard").href = curOriginWithPort(8265);
const yamlEl = document.getElementById("yaml");
const msg = document.getElementById("msg");
const TPL_PPO = "__TPL_PPO__";
const TPL_GRPO = "__TPL_GRPO__";
const TPL_SFT = "__TPL_SFT__";
const TPL_ADV = "__TPL_ADV__";
const TPL_EVAL = "__TPL_EVAL__";
const TPL_MERGE = "__TPL_MERGE__";
function showMode(mode) {
const a = document.getElementById("mode-yaml");
const b = document.getElementById("mode-form");
const py = document.getElementById("panel-yaml");
const pf = document.getElementById("panel-form");
if (mode === "yaml") {
a.classList.add("active"); b.classList.remove("active");
py.style.display = ""; pf.style.display = "none";
} else {
b.classList.add("active"); a.classList.remove("active");
pf.style.display = ""; py.style.display = "none";
}
}
function yamlQuote(v) {
v = String(v || "");
if (!v) return '""';
if (/^[A-Za-z0-9_./:-]+$/.test(v) && !v.includes("#")) return v;
return '"' + v.replaceAll("\\\\","\\\\\\\\").replaceAll('"','\\\\"') + '"';
}
function updateFormVisibility() {
const tpl = document.getElementById("f-tpl").value;
const basic = document.getElementById("f-basic");
const adv = document.getElementById("f-adv");
const sftDev = document.getElementById("f-sft-device-wrap");
if (tpl === "advanced" || tpl === "merge") {
basic.style.display = "none";
adv.style.display = "";
sftDev.style.display = "none";
} else {
basic.style.display = "";
adv.style.display = "none";
sftDev.style.display = (tpl === "sft") ? "" : "none";
}
}
function updatePreview() {
const tpl = document.getElementById("f-tpl").value;
if (tpl === "advanced" || tpl === "merge") {
const nnodes = Number(document.getElementById("f-adv-nnodes").value || "1") || 1;
const gpn = Number(document.getElementById("f-adv-gpn").value || "0") || 0;
const cmd = (document.getElementById("f-command").value || "").trimEnd();
const out = `kind: advanced\\n` +
`nnodes: ${nnodes}\\n` +
`n_gpus_per_node: ${gpn}\\n\\n` +
`command: |\\n` +
cmd.split("\\n").map(line => " " + line).join("\\n") + "\\n";
return out;
}
const workload = tpl;
const codePath = document.getElementById("f-code-path").value || "";
const modelId = document.getElementById("f-model-id").value || "";
const trainFile = document.getElementById("f-train-file").value || "";
const valFile = document.getElementById("f-val-file").value || "";
const nnodes = Number(document.getElementById("f-nnodes").value || "2") || 2;
const gpn = Number(document.getElementById("f-gpn").value || "4") || 4;
const epochs = Number(document.getElementById("f-epochs").value || "1") || 1;
const steps = (document.getElementById("f-steps").value || "").trim();
const saveFreq = Number(document.getElementById("f-save-freq").value || "10") || 10;
const testFreq = (document.getElementById("f-test-freq").value || "").trim();
const trainerDevice = document.getElementById("f-trainer-device").value || "cpu";
let y = "";
y += `workload: ${workload}\\n`;
y += `code_path: ${yamlQuote(codePath)}\\n`;
y += `model_id: ${yamlQuote(modelId)}\\n`;
y += `train_file: ${yamlQuote(trainFile)}\\n`;
y += `val_file: ${yamlQuote(valFile)}\\n`;
y += `nnodes: ${nnodes}\\n`;
y += `n_gpus_per_node: ${gpn}\\n`;
y += `total_epochs: ${epochs}\\n`;
if (steps) y += `total_training_steps: ${Number(steps)}\\n`;
y += `save_freq: ${saveFreq}\\n`;
if (testFreq) y += `test_freq: ${Number(testFreq)}\\n`;
if (workload === "sft") y += `trainer_device: ${trainerDevice}\\n`;
return y;
}
function bindStepper(inputId, decId, incId, min) {
const inp = document.getElementById(inputId);
document.getElementById(decId).onclick = () => { inp.value = String(Math.max(min, (Number(inp.value || "0") || 0) - 1)); inp.dispatchEvent(new Event("input")); };
document.getElementById(incId).onclick = () => { inp.value = String((Number(inp.value || "0") || 0) + 1); inp.dispatchEvent(new Event("input")); };
}
document.getElementById("tpl-ppo").onclick = () => { yamlEl.value = TPL_PPO; };
document.getElementById("tpl-grpo").onclick = () => { yamlEl.value = TPL_GRPO; };
document.getElementById("tpl-sft").onclick = () => { yamlEl.value = TPL_SFT; };
document.getElementById("tpl-adv").onclick = () => { yamlEl.value = TPL_ADV; };
document.getElementById("tpl-eval").onclick = () => { yamlEl.value = TPL_EVAL; };
document.getElementById("tpl-merge").onclick = () => { yamlEl.value = TPL_MERGE; };
document.getElementById("f-code-path").value = "/private/common/code/verl/verl_repo";
document.getElementById("f-model-id").value = "Qwen/Qwen2.5-0.5B-Instruct";
document.getElementById("f-train-file").value = "/private/common/datasets/gsm8k/train.parquet";
document.getElementById("f-val-file").value = "/private/common/datasets/gsm8k/test.parquet";
document.getElementById("f-nnodes").value = "2";
document.getElementById("f-gpn").value = "4";
document.getElementById("f-epochs").value = "1";
document.getElementById("f-save-freq").value = "10";
document.getElementById("f-trainer-device").value = "cpu";
document.getElementById("f-adv-nnodes").value = "2";
document.getElementById("f-adv-gpn").value = "4";
document.getElementById("f-command").value = `PYTHONUNBUFFERED=1 \\\\\\npython3 -m verl.trainer.main_ppo \\\\\\n data.train_files=$HOME/common/datasets/gsm8k/train.parquet \\\\\\n data.val_files=$HOME/common/datasets/gsm8k/test.parquet \\\\\\n +ray_kwargs.ray_init.address=auto`;
bindStepper("f-nnodes", "f-nnodes-dec", "f-nnodes-inc", 1);
bindStepper("f-gpn", "f-gpn-dec", "f-gpn-inc", 0);
bindStepper("f-epochs", "f-epochs-dec", "f-epochs-inc", 1);
bindStepper("f-adv-nnodes", "f-adv-nnodes-dec", "f-adv-nnodes-inc", 1);
bindStepper("f-adv-gpn", "f-adv-gpn-dec", "f-adv-gpn-inc", 0);
document.getElementById("f-tpl").onchange = () => {
const tpl = document.getElementById("f-tpl").value;
if (tpl === "ppo") {
document.getElementById("f-train-file").value = "/private/common/datasets/gsm8k/train.parquet";
document.getElementById("f-val-file").value = "/private/common/datasets/gsm8k/test.parquet";
} else if (tpl === "grpo") {
document.getElementById("f-train-file").value = "/private/common/datasets/gsm8k/train.parquet";
document.getElementById("f-val-file").value = "/private/common/datasets/gsm8k/test.parquet";
} else if (tpl === "sft") {
document.getElementById("f-train-file").value = "/private/common/datasets/gsm8k_sft/train.parquet";
document.getElementById("f-val-file").value = "/private/common/datasets/gsm8k_sft/test.parquet";
} else if (tpl === "advanced") {
document.getElementById("f-adv-nnodes").value = "2";
document.getElementById("f-adv-gpn").value = "4";
document.getElementById("f-command").value = TPL_ADV.split("command: |")[1]?.trimStart() || document.getElementById("f-command").value;
} else if (tpl === "merge") {
document.getElementById("f-adv-nnodes").value = "1";
document.getElementById("f-adv-gpn").value = "0";
document.getElementById("f-command").value = TPL_MERGE.split("command: |")[1]?.trimStart() || document.getElementById("f-command").value;
}
updateFormVisibility();
updatePreview();
};
for (const id of ["f-code-path","f-model-id","f-train-file","f-val-file","f-nnodes","f-gpn","f-epochs","f-steps","f-save-freq","f-test-freq","f-trainer-device","f-adv-nnodes","f-adv-gpn","f-command"]) {
const el = document.getElementById(id);
if (el) el.oninput = () => updatePreview();
}
document.getElementById("mode-yaml").onclick = () => {
// When switching to YAML, sync from form if form is visible.
const formVisible = document.getElementById("panel-form").style.display !== "none";
if (formVisible) yamlEl.value = updatePreview();
showMode("yaml");
};
document.getElementById("mode-form").onclick = () => {
showMode("form");
};
showMode("yaml");
document.getElementById("submit").onclick = async () => {
msg.textContent = "Submitting...";
const body = (document.getElementById("panel-form").style.display !== "none") ? updatePreview() : yamlEl.value;
const resp = await apiFetch("/api/v2/tasks", { method: "POST", headers: {"Content-Type":"text/plain"}, body });
const text = await resp.text();
if (!resp.ok) { msg.textContent = "Error: " + resp.status + "\\n" + text; return; }
const obj = JSON.parse(text);
msg.textContent = "OK: " + fmtJson(obj);
if (obj.task_id) window.location.href = "/ui/tasks/" + obj.task_id;
};
""".strip()
.replace("__TPL_PPO__", tpl_ppo)
.replace("__TPL_GRPO__", tpl_grpo)
.replace("__TPL_SFT__", tpl_sft)
.replace("__TPL_ADV__", tpl_adv)
.replace("__TPL_EVAL__", tpl_eval)
.replace("__TPL_MERGE__", tpl_merge)
)
return HTMLResponse(content=page("New Task", "new", body, script))

View File

@ -0,0 +1,101 @@
from __future__ import annotations
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from ..layout.page import page
def register(app: FastAPI) -> None:
@app.get("/ui/tasks")
async def ui_tasks() -> HTMLResponse:
body = """
<h1>Tasks</h1>
<div class="card">
<div class="row">
<button class="btn" id="refresh">Refresh</button>
<a class="btn" href="/ui/tasks/new" style="display:inline-block">New Task</a>
</div>
<div style="height:10px"></div>
<div id="out" class="muted">Loading...</div>
</div>
""".strip()
script = """
const out = document.getElementById("out");
async function refresh() {
out.textContent = "Loading...";
try {
const q = await apiJson("/api/v2/queue");
const completedLimit = 25;
const completedOffset = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const done = await apiJson("/api/v2/tasks?limit=" + completedLimit + "&offset=" + completedOffset + "&states=SUCCEEDED,FAILED,CANCELED");
function pill(state) {
const s = String(state || "");
if (s === "SUCCEEDED") return `<span class="pill ok">${s}</span>`;
if (s === "FAILED") return `<span class="pill bad">${s}</span>`;
if (s === "CANCELED") return `<span class="pill">${s}</span>`;
if (s === "RUNNING") return `<span class="pill ok">${s}</span>`;
if (s === "QUEUED" || s === "PENDING_RESOURCES" || s === "SUBMITTING" || s === "SUBMITTED") return `<span class="pill">${s}</span>`;
return `<span class="pill">${s}</span>`;
}
function row(t) {
const id = t.task_id;
return `<tr>
<td><a href="/ui/tasks/${id}">${id}</a></td>
<td>${t.workload}</td>
<td>${pill(t.state)}</td>
<td>${t.nnodes} x ${t.n_gpus_per_node} GPU</td>
<td>${t.updated_at || ""}</td>
</tr>`;
}
const running = (q.running || []).map(row).join("");
const pending = (q.pending || []).map(row).join("");
const doneRows = (done.tasks || []).map(row).join("");
const pageNo = Math.floor(completedOffset / completedLimit) + 1;
const prevDisabled = completedOffset <= 0;
const nextDisabled = !done.has_more;
out.innerHTML = `
<div class="muted">Tip: configure token in <a href="/ui/login">Login</a>.</div>
<div style="height:10px"></div>
<h3>Running</h3>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${running || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
<div style="height:12px"></div>
<h3>Pending</h3>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${pending || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
<div style="height:12px"></div>
<h3>Completed</h3>
<div class="row" style="justify-content: space-between; margin-bottom: 8px;">
<div class="muted">Page ${pageNo}</div>
<div class="row">
<button class="btn" id="done-prev" ${prevDisabled ? "disabled" : ""}>Prev</button>
<button class="btn" id="done-next" ${nextDisabled ? "disabled" : ""}>Next</button>
</div>
</div>
<table><thead><tr><th>Task</th><th>Workload</th><th>State</th><th>Resources</th><th>Updated</th></tr></thead><tbody>${doneRows || "<tr><td colspan=5 class=muted>(none)</td></tr>"}</tbody></table>
`;
const prevBtn = document.getElementById("done-prev");
const nextBtn = document.getElementById("done-next");
if (prevBtn) prevBtn.onclick = () => {
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const next = Math.max(0, cur - completedLimit);
localStorage.setItem("mvp_completed_offset", String(next));
refresh();
};
if (nextBtn) nextBtn.onclick = () => {
const cur = Number(localStorage.getItem("mvp_completed_offset") || "0") || 0;
const next = cur + completedLimit;
localStorage.setItem("mvp_completed_offset", String(next));
refresh();
};
} catch (e) {
out.textContent = "Error: " + (e.status || "") + "\\n" + (e.body || String(e));
}
}
document.getElementById("refresh").onclick = refresh;
refresh();
""".strip()
return HTMLResponse(content=page("Tasks", "tasks", body, script))

View File

@ -0,0 +1,22 @@
from __future__ import annotations
from fastapi import FastAPI
from fastapi.responses import RedirectResponse
from .pages import admin, data, login, serving, task_detail, task_logs, task_new, tasks
def register_ui_routes(app: FastAPI) -> None:
@app.get("/ui")
async def ui_root() -> RedirectResponse:
return RedirectResponse(url="/ui/tasks")
login.register(app)
tasks.register(app)
task_new.register(app)
task_detail.register(app)
task_logs.register(app)
serving.register(app)
data.register(app)
admin.register(app)

View File

@ -36,7 +36,7 @@ dexec "${HEAD_CONTAINER}" bash -lc "
"
echo "[head] prepare PPO dataset (gsm8k RL parquet) -> ${PPO_DATA_DIR}"
dexec "${HEAD_CONTAINER}" bash -lc "if [[ -f '${PPO_DATA_DIR}/train.parquet' && -f '${PPO_DATA_DIR}/test.parquet' ]]; then echo 'ppo_dataset_exists: skip'; else python3 /workspace/verl/examples/data_preprocess/gsm8k.py --local_save_dir '${PPO_DATA_DIR}'; fi"
dexec "${HEAD_CONTAINER}" bash -lc "if [[ -f '${PPO_DATA_DIR}/train.parquet' && -f '${PPO_DATA_DIR}/test.parquet' ]]; then echo 'ppo_dataset_exists: skip'; else PYTHONPATH=\"/workspace/verl:${PYTHONPATH:-}\" python3 /workspace/verl/examples/data_preprocess/gsm8k.py --local_save_dir '${PPO_DATA_DIR}'; fi"
echo "[head] prepare SFT dataset (gsm8k messages parquet) -> ${SFT_DATA_DIR}"
if dexec "${HEAD_CONTAINER}" bash -lc "test -f '${SFT_DATA_DIR}/train.parquet'"; then
@ -82,6 +82,7 @@ PY_CODE="$(cat <<'PY'
import os
model_id = os.environ["MODEL_ID"]
link_name = model_id.replace("/", "--")
hf_home = os.environ.get("HF_HOME", "/private/hf")
os.environ.setdefault("HF_HOME", hf_home)
@ -91,12 +92,27 @@ os.environ.setdefault("TRANSFORMERS_CACHE", os.path.join(hf_home, "transformers"
from huggingface_hub import snapshot_download
try:
snapshot_download(repo_id=model_id, local_files_only=True)
print("model_cache_exists: skip", model_id)
path = snapshot_download(repo_id=model_id, local_files_only=True)
print("model_cache_exists: skip", model_id, path)
except Exception:
print("model_cache_missing: downloading", model_id)
snapshot_download(repo_id=model_id)
print("model_cached_ok:", model_id)
path = snapshot_download(repo_id=model_id)
print("model_cached_ok:", model_id, path)
# v3.0 path policy: use a stable symlink under /private/common/models/...
common_models_dir = "/private/common/models"
os.makedirs(common_models_dir, exist_ok=True)
dst = os.path.join(common_models_dir, link_name)
try:
if os.path.islink(dst) or os.path.exists(dst):
os.unlink(dst)
except Exception:
pass
try:
os.symlink(path, dst)
print("model_common_link_ok:", dst)
except Exception as e:
print("WARN: model_common_link_failed:", dst, repr(e))
PY
)"

View File

@ -20,7 +20,7 @@ RESET_SFTPGO="${RESET_SFTPGO:-0}"
EXPECTED_RAY_NODES="${EXPECTED_RAY_NODES:-3}" # head + 2 workers
CLUSTER_NAME="${CLUSTER_NAME:-argus-ray}"
CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/configs/dev_v30.yaml}"
CONFIG_IN_CONTAINER="${CONFIG_IN_CONTAINER:-/workspace/mvp/configs/dev.yaml}"
SFTPGO_ADMIN_PASSWORD="${SFTPGO_ADMIN_PASSWORD:-my-dev-sftpgo-admin}"
export SFTPGO_ADMIN_PASSWORD
@ -105,7 +105,12 @@ submit_taskspec_inline() {
local resp
resp="$(curl -sS -H "Authorization: Bearer ${token}" -H "Content-Type: application/yaml" --data-binary "${yaml_body}" "${API_ADDR}/api/v2/tasks")"
echo "[host] submit_resp: ${resp}" >&2
printf '%s' "${resp}" | python3 -c 'import sys,json; print(json.load(sys.stdin)["task_id"])'
printf '%s' "${resp}" | python3 -c 'import sys,json; o=json.load(sys.stdin); tid=o.get("task_id",""); print(tid) if tid else sys.exit(42)'
local rc=$?
if [[ "${rc}" != "0" ]]; then
echo "ERROR: submit failed: ${resp}" >&2
return 1
fi
}
wait_task() {
@ -154,12 +159,15 @@ ray_wait_ready 60
echo "[host] wait sftpgo ready"
sftpgo_wait_ready 60 "http://127.0.0.1:8081/api/v2/token"
echo "[host] render v3.0 config with SFTPGo container IP (work around docker DNS issues)"
echo "[host] render config with SFTPGo container IP (work around docker DNS issues)"
SFTPGO_IP="$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' argus-sftpgo)"
RENDERED_CFG_HOST_PATH="/tmp/dev_v30_rendered.yaml"
sed -E "s#^(\\s*admin_api_base:) .*#\\1 \"http://${SFTPGO_IP}:8080/api/v2\"#g" "${ROOT_DIR}/configs/dev_v30.yaml" >"${RENDERED_CFG_HOST_PATH}"
docker cp "${RENDERED_CFG_HOST_PATH}" "${HEAD_CONTAINER}:/tmp/dev_v30_rendered.yaml"
CONFIG_IN_CONTAINER="/tmp/dev_v30_rendered.yaml"
RENDERED_CFG_HOST_PATH="/tmp/dev_rendered.yaml"
sed -E \
-e "s#^(\\s*admin_api_base:) .*#\\1 \"http://${SFTPGO_IP}:8080/api/v2\"#g" \
-e "/^ sftpgo:/,/^ retention:/ s#^(\\s*host:) .*#\\1 \"127.0.0.1\"#g" \
"${ROOT_DIR}/configs/dev.yaml" >"${RENDERED_CFG_HOST_PATH}"
docker cp "${RENDERED_CFG_HOST_PATH}" "${HEAD_CONTAINER}:/tmp/dev_rendered.yaml"
CONFIG_IN_CONTAINER="/tmp/dev_rendered.yaml"
echo "[host] verify head discovery record (supervised in-container)"
HEAD_IP_FILE="${SHARED_ROOT}/ray/discovery/${CLUSTER_NAME}/head.json"
@ -224,14 +232,7 @@ dexec "${HEAD_CONTAINER}" bash -lc "set -euo pipefail; \
(cp -f /private/common/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/train.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/train.parquet' 2>/dev/null || true); \
(cp -f /private/common/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || cp -f /private/datasets/gsm8k_sft/test.parquet '/private/users/${USER_ID}/datasets/gsm8k_sft/test.parquet' 2>/dev/null || true)"
echo "[host] resolve local model snapshot path (avoid HF mirror 429 for vllm rollout)"
LOCAL_MODEL_PATH="$(dexec "${HEAD_CONTAINER}" bash -lc "python3 - <<'PY'\nimport os\nfrom huggingface_hub import snapshot_download\nmodel_id=os.environ.get('MODEL_ID','Qwen/Qwen2.5-0.5B-Instruct')\nos.environ.setdefault('HF_HOME','/private/hf')\ntry:\n p=snapshot_download(repo_id=model_id, local_files_only=True)\n print(p)\nexcept Exception as e:\n raise SystemExit(f'ERROR: model snapshot not in cache; run 30_prepare_data_and_model.sh first. {e!r}')\nPY\n" MODEL_ID='Qwen/Qwen2.5-0.5B-Instruct' | tail -n 1)"
if [[ -z "${LOCAL_MODEL_PATH}" || "${LOCAL_MODEL_PATH}" != /* ]]; then
echo "ERROR: failed to resolve LOCAL_MODEL_PATH: ${LOCAL_MODEL_PATH}" >&2
exit 1
fi
echo "[host] local_model_path: ${LOCAL_MODEL_PATH}"
LOCAL_MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"
echo "[host] submit PPO/GRPO/SFT via API using user dataset paths"
PPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: ppo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: '"${LOCAL_MODEL_PATH}"$'\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"
GRPO_TASK_ID="$(submit_taskspec_inline "${USER_TOKEN}" $'workload: grpo\nnnodes: 2\nn_gpus_per_node: 4\ncode_path: /private/common/code/verl/verl_repo\ntrain_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/train.parquet\nval_file: /private/users/'"${USER_ID}"$'/datasets/gsm8k/test.parquet\nmodel_id: '"${LOCAL_MODEL_PATH}"$'\ntotal_epochs: 1\ntotal_training_steps: 10\nsave_freq: 10\n')"

27
walkthrough.md Normal file
View File

@ -0,0 +1,27 @@
# 源码走读
## API Server
API Server是平台服务入口提供API和web UI功能。
- 入口src/mvp/py/server.py
- 配置文件src/mvp/configs/dev.yaml解析成V2Config数据结构
- 核心模块:
- service服务层包括API服务任务调度模型调度等后台任务
- core 核心数据结构任务IDsubmission IDmodel serving ID定义
- ray 对接ray cluster的client
## service核心模块
API Server核心模块包括
- app.py: API 入口层提供webui将API请求转化为各类数据库操作以及协调SFTPGo服务
- scheduler.py定时任务调度训练任务队列提交ray job更新数据库中任务状态
- serving_reconciler.py: 定时任务模型推理调度提交ray serve app调整副本数更新数据库中模型实例状态
- db.py: 数据库操作基础工具类
- janitor.py定时清理训练任务中间过程数据
- sftpgo.py: 对接SFTPGo的客户端创建用户、重置密码、初始化用户目录等
## core核心模块
- ray_job_tool.py通过builder模板构建ray job命令提交ray job到ray集群。
- head_publisher.py: Ray head节点自动写head IP 文件到共享存储
- worker_watchdog.py: 自动检查最新head IP文件加入head的ray集群。
- models: 定义RayConfig/JobSepc/AdvancedTaskSpec等核心数据结构