NASP/argus-cluster

Fork 0

yuyr ac9c80ed8c V3.9 重构ui.py，每个网页拆分独立文件

2026-01-12 17:24:34 +08:00

4.7 KiB

Raw Blame History

Argus Cluster

本仓库主要代码位于 src/mvp/py/，单元测试位于 src/mvp/py/tests/（pytest.ini 已配置 testpaths）。

使用 uv 创建/激活虚拟环境并安装依赖

前置条件：已安装 uv（Astral 的 Python 包/虚拟环境工具）。

uv venv .venv

激活虚拟环境：

source .venv/bin/activate

在虚拟环境中安装依赖（运行与测试依赖）：

uv pip install -r src/mvp/py/requirements.txt -r src/mvp/py/requirements-dev.txt

运行单元测试

在仓库根目录执行：

pytest

如需显式使用 Python 模块方式：

python -m pytest

远端开发（h1）：同步代码、构建镜像、初始化共享目录、拉起 Ray 集群

src/mvp/docker-compose.yaml 以 src/mvp/ 为工作目录，并挂载 ../../shared 与 ../../verl，因此推荐远端目录结构如下：

/home2/argus/infra/mvp/src/mvp/：本仓库的 src/mvp/ 内容
/home2/argus/infra/mvp/shared/：共享目录（模拟/对齐 NFS）
/home2/argus/infra/mvp/verl/：训练代码仓库（脚本会检查该目录存在）

同步代码到远端（只同步 src/mvp/，确保相对路径挂载正确）：

ssh argus@h1 "mkdir -p /home2/argus/infra/mvp/src/mvp"
rsync -av --delete src/mvp/ argus@h1:/home2/argus/infra/mvp/src/mvp/

在远端准备 verl 仓库（若已存在可跳过）：

ssh argus@h1 "mkdir -p /home2/argus/infra/mvp && test -d /home2/argus/infra/mvp/verl || echo '缺少 /home2/argus/infra/mvp/verl（请先 git clone）'"

# 下载verl
ssh argus@h1 "cd /home2/argus/infra/mvp && git clone https://github.com/volcengine/verl.git"

登录远端并初始化共享目录环境（会创建 ../../shared/... 等目录）：

ssh argus@h1
cd /home2/argus/infra/mvp/src/mvp
./scripts/00_prereq_check.sh

构建 Ray 节点镜像并拉起集群（首次或镜像不存在时建议强制构建）：

cd /home2/argus/infra/mvp/src/mvp
BUILD=1 ./scripts/01_up.sh

验证集群状态：

./scripts/50_status.sh
curl -sS http://127.0.0.1:8265/api/version | head
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' | grep -E 'argus-ray-|sftpgo|wandb' || true

关闭集群：

./scripts/02_down.sh

远端（h1）：端口映射、创建 W&B user用户

使用以下命令将h1的端口映射到本机

 ssh -p 12022 -L 8265:127.0.0.1:8265 -L 8080:127.0.0.1:8080 -L 8081:127.0.0.1:8081 -L 2022:127.0.0.1:2022 -L 8090:127.0.0.1:8090 -L 8000:127.0.0.1:8000 -o ProxyJump=ssh@jump.nasp.fit:36022 nasp@192.168.20.121

其中：

8265: ray dashboard
8080: webui & API 服务端口
8081: sftpgo web client
2022: sftpgo sftp协议端口
8090: weight & bias 网站端口
8000: model serving openai服务端口

远端（h1）：预先准备数据/模型，启动与关闭 API Server

下面命令均在远端执行：

ssh argus@h1
cd /home2/argus/infra/mvp/src/mvp

预先下载数据集与模型（写入共享目录，幂等可重复执行）：

# 可选：指定要缓存的模型（默认 Qwen/Qwen2.5-0.5B-Instruct）
MODEL_ID="Qwen/Qwen2.5-0.5B-Instruct" ./scripts/30_prepare_data_and_model.sh

安装 API 依赖（在 head 容器内 best-effort 安装 fastapi/uvicorn/yaml 等）：

./scripts/12_install_api_deps.sh

启动 API Server（监听 :8080，需要设置鉴权 token）：

export MVP_INTERNAL_TOKEN="your-dev-token"
./scripts/60_start_api.sh
curl -sS http://127.0.0.1:8080/docs | head

查看 API 进程状态（基于 pid 文件）：

./scripts/62_status_api.sh

关闭 API Server：

./scripts/61_stop_api.sh

远端（h1）：W&B Local（wandb）License 激活常见问题（端口转发）

如果你通过 SSH 端口转发访问 W&B UI，但在 “Add license” 页面点击 “Update” 时提示： Unable to reach the backend api，通常是因为 W&B 前端使用了“绝对 URL”去请求后端 API，而该 URL 的 host/port 与你当前端口转发的地址不一致（例如 UI 打开在 http://localhost:8090，但前端去请求 http://localhost:8080）。

建议二选一：

方式 A（推荐，最简单）：把本地 8080 转发到远端 8090，然后用 http://localhost:8080 打开 UI：
- ssh -L 8080:127.0.0.1:8090 argus@h1
方式 B：在远端启动容器前设置 WANDB_HOST（与浏览器地址一致），并重建 wandb 容器：
- export WANDB_HOST=http://localhost:8090
- docker compose -f /home2/argus/infra/mvp/src/mvp/docker-compose.yaml up -d --force-recreate wandb

也可以通过设置 WANDB_LICENSE=... 在容器启动时注入 license，绕过 UI 更新步骤。

4.7 KiB Raw Blame History Unescape Escape