Skip to content

PipelineTrainer + LocalBackend dedicated mode hangs at training step 2 #656

@arcticfly

Description

@arcticfly

Summary

When using PipelineTrainer with LocalBackend in dedicated mode (trainer_gpu_ids/inference_gpu_ids set), training consistently hangs after completing step 1. Both GPUs drop to 0% utilization and the process never advances to step 2.

How to reproduce

Branch: test-forked-v5

uv run dev/yes-no-maybe-fork-pipeline.py

Or on a 2-GPU cloud node via SkyPilot:

sky launch dev/yes-no-maybe-fork-pipeline.sky.yaml --env WANDB_API_KEY=<key>

Requirements: 2 GPUs (H100/H200/A100-80GB), meta-llama/Llama-3.1-8B-Instruct access.

Observed behavior

Training base model from step 0 → 10...
  Eval at step 0: 70.8%
[23:01:36] INFO service.py:617: [DEDICATED] _train_dedicated: inference weights updated for step 1
<hangs indefinitely — both GPUs at 0% utilization>

Step 1 completes successfully (rollouts → train → checkpoint save → adapter reload), but step 2 never starts. Tested across 6 separate job runs; all exhibit the same hang.

What does NOT hang

  • yes-no-maybe-fork.py (same yes/no/maybe task and model, same checkpoint forking, but using a plain manual training loop instead of PipelineTrainer) — works correctly through 10 steps + fork + 2 more steps. This confirms the underlying model, rewards, and LocalBackend dedicated mode are all functional; the hang is specific to PipelineTrainer.
  • The integration test tests/integration/test_pipeline_localbackend_dedicated.py (max_steps=2, min_batch_size=1, eval_fn=None, Qwen3-0.6B) — passes.

Suspected location

Inside run_unsloth_rl_training in src/art/unsloth/train.py, one of two places:

  1. await ctx.results_queue.join() at the start of each step — if _unfinished_tasks > 0 because GRPOTrainer called log() more times than expected for step 1, this would block forever.
  2. asyncio.wait([results_queue.get(), ctx.train_task], ...) — if neither task completes in the context of the nested event loop (via nest_asyncio).

Diagnostic logging added

The branch adds print statements around both suspects and a 300-second timeout on asyncio.wait (raises TimeoutError instead of hanging silently). A new run should pinpoint the exact hang location via the printed queue state.

Key config differences vs. the passing integration test:

  • Model: Llama 3.1 8B (vs. Qwen3-0.6B)
  • eval_fn present (eval fires at step 0)
  • min_batch_size=3 (vs. 1)
  • ROLLOUTS_PER_SCENARIO=8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions