Unified training framework for OpenClaw on Tinker cloud infrastructure. Supports three training methods through a single entry point:
| Method | Flag | Description |
|---|---|---|
| RL | --method rl |
GRPO with PRM (Process Reward Model) scoring and at-least-one guarantee |
| OPD | --method opd |
On-Policy Distillation via hindsight hints + teacher log-probs |
| Combined | --method combine |
Weighted combination of OPD and RL advantages |
export TINKER_API_KEY="your-tinker-api-key"
# Combined method
python run.py --method combine --model-name Qwen/Qwen3-8B --prm-m 1 --batch-size 16 --w-opd 1.0 --w-rl 1.0 --train-epochs 2
# RL method
python run.py --method rl --model-name Qwen/Qwen3-8B --prm-m 3 --batch-size 16
# OPD method
python run.py --method opd --model-name Qwen/Qwen3-8B --prm-m 1 --batch-size 16run.py CLI entry point (--method {rl, opd, combine})
├── config.py Unified TinkerConfig dataclass
├── trainer.py Training loop: rollout → score → build datums → forward_backward → optim_step
│ ├── rollout.py RolloutWorker: launches API proxy, feeds prompts, collects completions
│ │ └── api_server.py OpenAI-compatible proxy with method-specific subclasses
│ ├── scorers.py PRMScorer / OPDScorer / CombinedScorer
│ └── data_formatter.py TrainingSample → Tinker Datum conversion
-
Trainer (
trainer.py): Orchestrates the full loop. Creates two Tinker clients — a LoRA training client (policy model) and a base sampling client (teacher/judge). Handles checkpoint saving at configurable intervals and graceful shutdown. -
RolloutWorker (
rollout.py): Spins up a local OpenAI-compatible API server that forwards requests to the Tinker policy model. External environments (OpenClaw tasks) connect to this server. Completed sessions are queued for scoring and training. -
API Server (
api_server.py): Base class_BaseServerprovides shared infrastructure (Tinker forwarding, auth, streaming, tokenization, record management). Three subclasses handle method-specific logic:OpenClawRLServer— PRM scoring with at-least-one guaranteeOpenClawOPDServer— Hint judge + teacher log-probs, drops turns without next_stateOpenClawCombineServer— Three-way dispatch (opd+rl / opd-only / rl-only)
-
Scorers (
scorers.py): Each scorer evaluates completed sessions and producesTrainingSampleobjects with rewards and optional teacher log-probs. -
Data Formatter (
data_formatter.py): ConvertsTrainingSamplebatches into TinkerDatumobjects for training. RL/OPD use scalar GRPO advantages; Combined computes per-tokenw_opd * teacher_adv + w_rl * reward.
All parameters can be set via CLI flags or environment variables:
| Flag | Env Var | Default | Description |
|---|---|---|---|
--model-name |
MODEL_NAME |
Qwen/Qwen3-4B-Instruct-2507 |
Policy model (must be Tinker-supported) |
--lora-rank |
LORA_RANK |
32 |
LoRA rank for training |
--teacher-model-name |
TEACHER_MODEL_NAME |
same as policy | Teacher/judge model (base, no LoRA) |
| Flag | Env Var | Default | Description |
|---|---|---|---|
--learning-rate |
LEARNING_RATE |
1e-4 |
Optimizer learning rate |
--batch-size |
BATCH_SIZE |
4 |
Samples per training step |
--max-steps |
MAX_STEPS |
1000 |
Total training steps |
--loss-fn |
LOSS_FN |
ppo |
Tinker loss: ppo, importance_sampling, cispo |
--kl-loss-coef |
KL_LOSS_COEF |
0.0 |
KL penalty coefficient |
--save-interval |
SAVE_INTERVAL |
20 |
Save checkpoint every N steps |
--resume-from-ckpt |
RESUME_FROM_CKPT |
Resume from checkpoint path |
| Flag | Env Var | Default | Method | Description |
|---|---|---|---|---|
--w-opd |
OPENCLAW_COMBINE_W_OPD |
1.0 |
combine | OPD advantage weight |
--w-rl |
OPENCLAW_COMBINE_W_RL |
1.0 |
combine | RL advantage weight |
--train-epochs |
TRAIN_EPOCHS |
1 |
all | Duplicate samples N times per rollout batch (combine typically uses 2) |
--eval-mode |
EVAL_MODE |
false |
opd | Enable PRM eval scoring alongside OPD |
| Flag | Env Var | Default | Description |
|---|---|---|---|
--prm-m |
PRM_M |
3 |
Number of judge samples (majority voting) |
--prm-temperature |
PRM_TEMPERATURE |
0.6 |
Sampling temperature for judge |
--prm-max-tokens |
PRM_MAX_TOKENS |
4096 |
Max tokens for judge response |
| Flag | Env Var | Default | Description |
|---|---|---|---|
--proxy-host |
PROXY_HOST |
0.0.0.0 |
API server bind host |
--proxy-port |
PROXY_PORT |
30000 |
API server bind port |
--served-model-name |
SERVED_MODEL_NAME |
qwen3-4b |
Model name in OpenAI API responses |
--api-key |
SGLANG_API_KEY |
API key for proxy authentication |
Standard GRPO reinforcement learning with Process Reward Model scoring:
- Policy model generates responses via the API proxy
- PRM evaluates each turn by scoring the next state (majority vote over M samples)
- Rewards:
+1(correct),-1(incorrect),0(uncertain) - At-least-one guarantee: If all turns in a session score ≤ 0, the best turn gets reward = +1
- GRPO advantages (scalar reward broadcast) → Tinker Datum → training step
On-Policy Distillation using hindsight hints and teacher knowledge:
- Policy model generates responses; environment provides next_state observations
- Hint judge extracts key information from next_state into a concise hint
- Teacher model scores the response (with hint context) to get token-level log-probs
- Advantage = per-token distillation:
teacher_lp - student_lp - All samples get reward = 1.0 (no explicit reward signal)
- Optional
--eval-mode: also compute PRM eval scores for monitoring
Weighted combination with three-way sample dispatch:
- OPD+RL samples (have both next_state and reward): get both advantage components
- OPD-only samples (next_state but no reward): only teacher distillation advantage
- RL-only samples (reward but no next_state): only scalar reward advantage
Combined advantage per token:
combined_adv_i = w_opd * (teacher_lp_i - student_lp_i) + w_rl * reward
This project uses the Tinker cloud platform for:
- LoRA Training:
create_lora_training_client_async(base_model=..., rank=...)for the policy model - Sampling:
create_sampling_client_async(base_model=...)for the teacher/judge model - Training ops:
forward_backward_async()+optim_step_async()per step - Checkpointing:
save_weights_and_get_sampling_client_async()to update the policy sampling client - Loss functions: Supports
ppo,importance_sampling,cispo