Summary
When using PipelineTrainer with LocalBackend in dedicated mode (trainer_gpu_ids/inference_gpu_ids set), training consistently hangs after completing step 1. Both GPUs drop to 0% utilization and the process never advances to step 2.
How to reproduce
Branch: test-forked-v5
uv run dev/yes-no-maybe-fork-pipeline.py
Or on a 2-GPU cloud node via SkyPilot:
sky launch dev/yes-no-maybe-fork-pipeline.sky.yaml --env WANDB_API_KEY=<key>
Requirements: 2 GPUs (H100/H200/A100-80GB), meta-llama/Llama-3.1-8B-Instruct access.
Observed behavior
Training base model from step 0 → 10...
Eval at step 0: 70.8%
[23:01:36] INFO service.py:617: [DEDICATED] _train_dedicated: inference weights updated for step 1
<hangs indefinitely — both GPUs at 0% utilization>
Step 1 completes successfully (rollouts → train → checkpoint save → adapter reload), but step 2 never starts. Tested across 6 separate job runs; all exhibit the same hang.
What does NOT hang
yes-no-maybe-fork.py (same yes/no/maybe task and model, same checkpoint forking, but using a plain manual training loop instead of PipelineTrainer) — works correctly through 10 steps + fork + 2 more steps. This confirms the underlying model, rewards, and LocalBackend dedicated mode are all functional; the hang is specific to PipelineTrainer.
- The integration test
tests/integration/test_pipeline_localbackend_dedicated.py (max_steps=2, min_batch_size=1, eval_fn=None, Qwen3-0.6B) — passes.
Suspected location
Inside run_unsloth_rl_training in src/art/unsloth/train.py, one of two places:
await ctx.results_queue.join() at the start of each step — if _unfinished_tasks > 0 because GRPOTrainer called log() more times than expected for step 1, this would block forever.
asyncio.wait([results_queue.get(), ctx.train_task], ...) — if neither task completes in the context of the nested event loop (via nest_asyncio).
Diagnostic logging added
The branch adds print statements around both suspects and a 300-second timeout on asyncio.wait (raises TimeoutError instead of hanging silently). A new run should pinpoint the exact hang location via the printed queue state.
Key config differences vs. the passing integration test:
- Model: Llama 3.1 8B (vs. Qwen3-0.6B)
eval_fn present (eval fires at step 0)
min_batch_size=3 (vs. 1)
ROLLOUTS_PER_SCENARIO=8
Summary
When using
PipelineTrainerwithLocalBackendin dedicated mode (trainer_gpu_ids/inference_gpu_idsset), training consistently hangs after completing step 1. Both GPUs drop to 0% utilization and the process never advances to step 2.How to reproduce
Branch:
test-forked-v5Or on a 2-GPU cloud node via SkyPilot:
Requirements: 2 GPUs (H100/H200/A100-80GB),
meta-llama/Llama-3.1-8B-Instructaccess.Observed behavior
Step 1 completes successfully (rollouts → train → checkpoint save → adapter reload), but step 2 never starts. Tested across 6 separate job runs; all exhibit the same hang.
What does NOT hang
yes-no-maybe-fork.py(same yes/no/maybe task and model, same checkpoint forking, but using a plain manual training loop instead ofPipelineTrainer) — works correctly through 10 steps + fork + 2 more steps. This confirms the underlying model, rewards, andLocalBackenddedicated mode are all functional; the hang is specific toPipelineTrainer.tests/integration/test_pipeline_localbackend_dedicated.py(max_steps=2,min_batch_size=1,eval_fn=None, Qwen3-0.6B) — passes.Suspected location
Inside
run_unsloth_rl_traininginsrc/art/unsloth/train.py, one of two places:await ctx.results_queue.join()at the start of each step — if_unfinished_tasks > 0because GRPOTrainer calledlog()more times than expected for step 1, this would block forever.asyncio.wait([results_queue.get(), ctx.train_task], ...)— if neither task completes in the context of the nested event loop (vianest_asyncio).Diagnostic logging added
The branch adds print statements around both suspects and a 300-second timeout on
asyncio.wait(raisesTimeoutErrorinstead of hanging silently). A new run should pinpoint the exact hang location via the printed queue state.Key config differences vs. the passing integration test:
eval_fnpresent (eval fires at step 0)min_batch_size=3(vs. 1)ROLLOUTS_PER_SCENARIO=8