PipelineTrainer + LocalBackend dedicated mode hangs at training step 2

## Summary

When using `PipelineTrainer` with `LocalBackend` in dedicated mode (`trainer_gpu_ids`/`inference_gpu_ids` set), training consistently hangs after completing step 1. Both GPUs drop to 0% utilization and the process never advances to step 2.

## How to reproduce

Branch: `test-forked-v5`

```bash
uv run dev/yes-no-maybe-fork-pipeline.py
```

Or on a 2-GPU cloud node via SkyPilot:

```bash
sky launch dev/yes-no-maybe-fork-pipeline.sky.yaml --env WANDB_API_KEY=<key>
```

**Requirements:** 2 GPUs (H100/H200/A100-80GB), `meta-llama/Llama-3.1-8B-Instruct` access.

## Observed behavior

```
Training base model from step 0 → 10...
  Eval at step 0: 70.8%
[23:01:36] INFO service.py:617: [DEDICATED] _train_dedicated: inference weights updated for step 1
<hangs indefinitely — both GPUs at 0% utilization>
```

Step 1 completes successfully (rollouts → train → checkpoint save → adapter reload), but step 2 never starts. Tested across 6 separate job runs; all exhibit the same hang.

## What does NOT hang

- **`yes-no-maybe-fork.py`** (same yes/no/maybe task and model, same checkpoint forking, but using a plain manual training loop instead of `PipelineTrainer`) — **works correctly** through 10 steps + fork + 2 more steps. This confirms the underlying model, rewards, and `LocalBackend` dedicated mode are all functional; the hang is specific to `PipelineTrainer`.
- The integration test `tests/integration/test_pipeline_localbackend_dedicated.py` (`max_steps=2`, `min_batch_size=1`, `eval_fn=None`, Qwen3-0.6B) — **passes**.

## Suspected location

Inside `run_unsloth_rl_training` in `src/art/unsloth/train.py`, one of two places:

1. **`await ctx.results_queue.join()`** at the start of each step — if `_unfinished_tasks > 0` because GRPOTrainer called `log()` more times than expected for step 1, this would block forever.
2. **`asyncio.wait([results_queue.get(), ctx.train_task], ...)`** — if neither task completes in the context of the nested event loop (via `nest_asyncio`).

## Diagnostic logging added

The branch adds print statements around both suspects and a **300-second timeout** on `asyncio.wait` (raises `TimeoutError` instead of hanging silently). A new run should pinpoint the exact hang location via the printed queue state.

Key config differences vs. the passing integration test:
- Model: Llama 3.1 8B (vs. Qwen3-0.6B)
- `eval_fn` present (eval fires at step 0)
- `min_batch_size=3` (vs. 1)
- `ROLLOUTS_PER_SCENARIO=8`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipelineTrainer + LocalBackend dedicated mode hangs at training step 2 #656

Summary

How to reproduce

Observed behavior

What does NOT hang

Suspected location

Diagnostic logging added

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PipelineTrainer + LocalBackend dedicated mode hangs at training step 2 #656

Description

Summary

How to reproduce

Observed behavior

What does NOT hang

Suspected location

Diagnostic logging added

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions