Skip to content

Fix HYBRID_SHARD failure when world_size < available GPUs#682

Merged
RobotSail merged 1 commit intoinstructlab:mainfrom
rtj1:fix-hybrid-shard-world-size
Feb 3, 2026
Merged

Fix HYBRID_SHARD failure when world_size < available GPUs#682
RobotSail merged 1 commit intoinstructlab:mainfrom
rtj1:fix-hybrid-shard-world-size

Conversation

@rtj1
Copy link
Copy Markdown
Contributor

@rtj1 rtj1 commented Feb 3, 2026

Summary

Fixes FSDP training failure when using HYBRID_SHARD sharding strategy on systems where world_size is less than the number of available GPUs per node.

Problem

When running with --fsdp_sharding_strategy=HYBRID_SHARD on a system with 8 GPUs but only using 2 processes (e.g., nproc_per_node=2), FSDP fails with:

ValueError: The arg 'group_size' (8) must not exceed the world size (2)

This happens because FSDP1 auto-detects num_devices_per_node from torch.cuda.device_count() and tries to create intra-node process groups of that size.

Solution

Detect when HYBRID_SHARD would fail due to world_size < num_devices_per_node and automatically fall back to FULL_SHARD with a warning message.

Changes

  • Added check in get_fsdp_config() to compare world_size with torch.cuda.device_count()
  • Falls back to FULL_SHARD when HYBRID_SHARD would fail
  • Logs a warning to inform the user of the fallback

Testing

Tested the fix scenario where world_size=2 and device_count=8:

  • Before: ValueError: The arg 'group_size' (8) must not exceed the world size (2)
  • After: Training proceeds with FULL_SHARD and warning logged

Fixes #678

Summary by CodeRabbit

Bug Fixes

  • Enhanced distributed training stability through runtime validation of configuration settings. The system now automatically detects resource constraints and adjusts settings accordingly, preventing training errors and ensuring more reliable execution across different hardware environments and configurations.

When using FSDP with HYBRID_SHARD sharding strategy, FSDP1 auto-detects
num_devices_per_node from torch.cuda.device_count(). It then tries to
create intra-node process groups of that size, which fails when
world_size < num_devices_per_node with:

  ValueError: The arg 'group_size' (8) must not exceed the world size (2)

This fix detects when HYBRID_SHARD would fail due to this constraint
and falls back to FULL_SHARD with a warning, allowing training to
proceed on systems with fewer GPUs than available.

Fixes instructlab#678
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

This change adds a runtime guard to the FSDP configuration that detects when world_size is insufficient for HYBRID_SHARD sharding strategy. When this condition is met, it logs a warning and automatically falls back to FULL_SHARD to prevent a PyTorch distributed error.

Changes

Cohort / File(s) Summary
FSDP Sharding Strategy Guard
src/instructlab/training/accelerator.py
Adds runtime validation in get_fsdp_config to detect incompatible HYBRID_SHARD configuration (world_size < num_devices_per_node), logs a warning, and falls back to FULL_SHARD strategy. Refactors plugin initialization to use the adjusted sharding strategy value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A bunny hops through HYBRID paths,
But guards prevent the world's collapse,
When shards grow too, we softly fall,
To FULL_SHARD's safety—that's all! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a runtime guard to prevent HYBRID_SHARD failures when world_size is less than available GPUs.
Linked Issues check ✅ Passed The PR implements the core requirement from #678: detecting insufficient world_size for HYBRID_SHARD and falling back to FULL_SHARD with a warning.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the HYBRID_SHARD failure issue. The modifications to accelerator.py are scoped to the get_fsdp_config function and address the problem described in issue #678.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify Bot added the ci-failure label Feb 3, 2026
Copy link
Copy Markdown
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your contribution @rtj1 , LGTM!

@mergify mergify Bot added the one-approval label Feb 3, 2026
@RobotSail RobotSail merged commit aa9d705 into instructlab:main Feb 3, 2026
14 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HYBRID_SHARD fails when world_size < available GPUs

2 participants