|
9fc61ca82b
|
configure for group < 10 trace grpo training
|
2025-07-01 09:50:29 +08:00 |
|
|
4295f30f9a
|
update direct stepwise train, pass trainning with mock one step trace duplicated 52 times; reward OK, curve improved as the step grows.
|
2025-06-30 15:06:38 +00:00 |
|
|
5e632c53ac
|
temp save; temp hold on for rejection sample CoT; try direct RL with raw stepwise data first
|
2025-06-30 10:29:00 +00:00 |
|
|
7f4fc8b05b
|
use --system for system prompt
|
2025-06-26 22:46:09 +00:00 |
|
|
ee08da12c0
|
pass sample test web with custom orm
|
2025-06-26 18:02:13 +00:00 |
|
|
8f168ecbef
|
test swift for qwen3 math
|
2025-06-25 17:00:47 +08:00 |
|