Fix HYBRID_SHARD failure when world_size < available GPUs by rtj1 · Pull Request #682 · instructlab/training

rtj1 · 2026-02-03T15:28:44Z

Summary

Fixes FSDP training failure when using HYBRID_SHARD sharding strategy on systems where world_size is less than the number of available GPUs per node.

Problem

When running with --fsdp_sharding_strategy=HYBRID_SHARD on a system with 8 GPUs but only using 2 processes (e.g., nproc_per_node=2), FSDP fails with:

ValueError: The arg 'group_size' (8) must not exceed the world size (2)

This happens because FSDP1 auto-detects num_devices_per_node from torch.cuda.device_count() and tries to create intra-node process groups of that size.

Solution

Detect when HYBRID_SHARD would fail due to world_size < num_devices_per_node and automatically fall back to FULL_SHARD with a warning message.

Changes

Added check in get_fsdp_config() to compare world_size with torch.cuda.device_count()
Falls back to FULL_SHARD when HYBRID_SHARD would fail
Logs a warning to inform the user of the fallback

Testing

Tested the fix scenario where world_size=2 and device_count=8:

Before: ValueError: The arg 'group_size' (8) must not exceed the world size (2)
After: Training proceeds with FULL_SHARD and warning logged

Fixes #678

Summary by CodeRabbit

Bug Fixes

Enhanced distributed training stability through runtime validation of configuration settings. The system now automatically detects resource constraints and adjusts settings accordingly, preventing training errors and ensuring more reliable execution across different hardware environments and configurations.

When using FSDP with HYBRID_SHARD sharding strategy, FSDP1 auto-detects num_devices_per_node from torch.cuda.device_count(). It then tries to create intra-node process groups of that size, which fails when world_size < num_devices_per_node with: ValueError: The arg 'group_size' (8) must not exceed the world size (2) This fix detects when HYBRID_SHARD would fail due to this constraint and falls back to FULL_SHARD with a warning, allowing training to proceed on systems with fewer GPUs than available. Fixes instructlab#678

coderabbitai · 2026-02-03T15:29:06Z

📝 Walkthrough

Walkthrough

This change adds a runtime guard to the FSDP configuration that detects when world_size is insufficient for HYBRID_SHARD sharding strategy. When this condition is met, it logs a warning and automatically falls back to FULL_SHARD to prevent a PyTorch distributed error.

Changes

Cohort / File(s)	Summary
FSDP Sharding Strategy Guard `src/instructlab/training/accelerator.py`	Adds runtime validation in `get_fsdp_config` to detect incompatible HYBRID_SHARD configuration (world_size < num_devices_per_node), logs a warning, and falls back to FULL_SHARD strategy. Refactors plugin initialization to use the adjusted sharding strategy value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A bunny hops through HYBRID paths,
But guards prevent the world's collapse,
When shards grow too, we softly fall,
To FULL_SHARD's safety—that's all! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a runtime guard to prevent HYBRID_SHARD failures when world_size is less than available GPUs.
Linked Issues check	✅ Passed	The PR implements the core requirement from `#678`: detecting insufficient world_size for HYBRID_SHARD and falling back to FULL_SHARD with a warning.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the HYBRID_SHARD failure issue. The modifications to accelerator.py are scoped to the get_fsdp_config function and address the problem described in issue `#678`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

RobotSail

Thank you so much for your contribution @rtj1 , LGTM!

mergify Bot added the ci-failure label Feb 3, 2026

RobotSail approved these changes Feb 3, 2026

View reviewed changes

mergify Bot added the one-approval label Feb 3, 2026

RobotSail merged commit aa9d705 into instructlab:main Feb 3, 2026
14 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HYBRID_SHARD failure when world_size < available GPUs#682

Fix HYBRID_SHARD failure when world_size < available GPUs#682
RobotSail merged 1 commit intoinstructlab:mainfrom
rtj1:fix-hybrid-shard-world-size

rtj1 commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

RobotSail left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rtj1 commented Feb 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Summary by CodeRabbit

Bug Fixes

Uh oh!

coderabbitai Bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rtj1 commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 3, 2026 •

edited

Loading