instructlab-training 0.16.0 Release Notes

@RobotSail

instructlab-training 0.16.0 Release Notes

On-Demand Async Checkpointing for Preemption Recovery (#686)

Adds signal-driven checkpoint-and-exit for distributed training jobs running on OpenShift AI, KubeFlow, or multi-node bare metal, where jobs can be preempted with limited grace periods before SIGKILL.

How it works: When on_demand_checkpointing=True is set in TrainingArgs, the training process installs signal handlers for SIGTERM, SIGINT, SIGUSR1, and other termination signals. On signal receipt, a trigger file is written to /dev/shm. Worker processes check for this trigger at five synchronization points per training step (before/after forward, backward, and optimizer steps), ensuring response within one fwd+bwd cycle (~1-2s). When detected, all ranks collectively save a full-state distributed checkpoint (model + optimizer + LR scheduler) and exit gracefully.

Numba Compatibility Fix (#702)

Relaxes the numba version pin from >=0.62.0 to >=0.61.2, resolving dependency conflicts when installing alongside vLLM (which pins numba==0.61.2). No 0.62-specific APIs are used — all numba usage (@njit, int64) is compatible with 0.61.x.

What's Changed

Add on-demand full-state checkpointing for OpenShift AI / KubeFlow preemption by @RobotSail in #686
build(deps): Bump astral-sh/setup-uv from 4 to 7 by @dependabot[bot] in #692
build(deps): Bump aws-actions/configure-aws-credentials from 4.2.1 to 6.0.0 by @dependabot[bot] in #689
Relax numba pin from >=0.62.0 to >=0.61.2 by @Maxusmusti in #702

Full Changelog: v0.15.2...v0.16.0

@RobotSail

What's Changed

Fix trust_remote_code and gradient checkpointing for custom models by @RobotSail @Maxusmusti in #696

Full Changelog: v0.15.1...v0.15.2

@RobotSail

What's Changed

Fix Gemma 3 SFT training by detecting dual-registered VLM configs by @RobotSail in #695

Full Changelog: v0.15.0...v0.15.1

@RobotSail

What's New

Features

Vision-Language Model (VLM) Support for Text-Only Training (#693)
- Added automatic detection and loading of vision-language models for text-only training
- New vlm_utils.py module with utilities for identifying and extracting CausalLM text backbones from VLM wrappers
- Support for two VLM loading strategies: extracting the text backbone when a CausalLM sub-model exists, or direct VLM loading when no CausalLM variant is available
- Improved tokenizer/text-config reconciliation for VLMs where vocab_size lives under text_config
Mixed Attention Handling for VLMs (#693)
- Models with timm vision towers now use per-component attention: eager for vision, flash_attention_2 or sdpa for text
- Automatic SDPA fallback for M-RoPE models (e.g. Qwen3.5 VL) which are incompatible with Flash Attention 2

Bug Fixes

FSDP Wrap Policy Robustness (#693)
- Fixed _no_split_modules resolution to handle models that declare module names for architectures not loaded (e.g. vision blocks when loading only the CausalLM)
- FSDP wrap policy now resolves all declared module names against both the wrapper and underlying HF model, filtering out unresolvable entries
GPT-OSS Attention Capability Detection (#693)
- vllm-flash-attn3 is now gated behind a Hopper (SM 9.0+) GPU capability check, falling back to eager on older hardware

Improvements

Local Mamba Kernel Preference (#693)
- GraniteMoeHybrid models now pre-populate the Hub kernel cache with locally installed mamba_ssm and causal_conv1d to avoid PyTorch/CUDA ABI mismatches with Hub-provided kernel builds

What's Changed

add support for qwen3.5 vl model by @RobotSail in #693

Full Changelog: v0.14.2...v0.15.0

@Maxusmusti

What's Changed

Add backwards compatibility for transformers v4.57 by @Maxusmusti in #684
Adds Validation Adds validation loss + exposes it in the API by @RobotSail in #685

Full Changelog: v0.14.1...v0.14.2

@Maxusmusti

What's Changed

fix _no_split_modules subscript error for transformers v5 by @Maxusmusti in #683

Full Changelog: v0.14.0...v0.14.1

@RobotSail

What's New

Features

MLflow Logging Backend (#680)
- Added MLflowHandler class for logging training metrics to MLflow
- New TrainingArgs fields: mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name
- Added wandb_project, wandb_entity, wandb_run_name fields for W&B configuration
- Added tensorboard_log_dir field for configurable TensorBoard log directory
- New optional install targets: requirements-mlflow.txt, requirements-wandb.txt, requirements-tensorboard.txt
Transformers v5 Compatibility (#681)
- Updated tokenizer API calls to use extra_special_tokens instead of additional_special_tokens
- Suppressed verbose httpx HTTP request logs from huggingface_hub

Bug Fixes

HYBRID_SHARD Failure Fix (#682)
- Added detection for when world_size < num_devices_per_node in FSDP configuration
- Automatically falls back to FULL_SHARD with a warning when HYBRID_SHARD would fail

Development

Tox-UV Integration (#676)
- Added tox-uv as a tox requirement with uv-venv-runner
- Updated GitHub workflows to use uv for package installation
- Replaced pip install with uv pip install in CI workflows

What's Changed

adds integration for tox-uv and updates workflows to use tox-uv by @RobotSail in #676
Add transformers v5 compatibility by @Maxusmusti in #681
Fix HYBRID_SHARD failure when world_size < available GPUs by @rtj1 in #682
Add MLflow support and expose logging configuration in TrainingArgs by @RobotSail in #680

New Contributors

@rtj1 made their first contribution in #682 🎉

Files Changed

18 files changed with 482 insertions and 83 deletions:

Core training modules: logger.py, config.py, accelerator.py, data_process.py, tokenizer_utils.py, main_ds.py
New requirements files for optional logging backends
Updated CI workflows and tox configuration

Full Changelog: v0.13.0...v0.14.0

What's New

Features

Pretraining Data Processing API (#672)
- Added new API for processing pretraining-style datasets
- Documents are now chunked by configurable block_size
- Chunks are treated as independent, fully-unmasked samples
- Updated training loop to ingest pretraining-style datasets
- Includes comprehensive test coverage (test_pretraining_data_process.py, test_pretraining_mode.py, test_pretraining_sampler.py)
AdamW Optimizer Configuration (#674)
- Exposed weight_decay, betas, and eps parameters in TrainingArgs
- Users can now tune AdamW hyperparameters through run_training() API
- Provides more control over optimizer behavior
Granite 4 Model Support (#669)
- Added support for Granite 4 models as Mixture of Experts (MoE) models in training

Bug Fixes

Process Timing Fix (#675)
- Fixed race condition where process wasn't completed by the time it was read
Variable Access Fix (#668)
- Fixed invalid variable access bug

Dependencies

Build Dependency Update (#670)
- Updated hynek build dependency

Files Changed

17 files changed with 1,642 insertions and 52 deletions:

Core training modules: data_process.py, main_ds.py, sampler.py, model.py, config.py
New test suites for pretraining functionality
Updated README with new capabilities

Full Changelog

All Changes:

574f946 Exposes API for processing pretraining data (#672)
638a753 fixes bug where process isn't completed by the time the process gets read (#675)
c495035 Expose AdamW optimizer parameters in training API (#674)
3d05302 Handle granite 4 as MoE models in training (#669)
781c36f fixes stray invalid variable access bug (#668)
529c2f7 bumps hynek build dep (#670)

Full Diff: v0.12.1...v0.13.0

@Maxusmusti

What's Changed

Update requirements-cuda.txt to increase liger-kernel minimum by @Maxusmusti in #659
Adds mamba-ssm[causal-conv1d] to CUDA requirements by @RobotSail in #663
Removes Numpy version cap by @RobotSail in #664
fix(torchrun): Omit empty arguments and correct nproc_per_node type by @szaher in #661

New Contributors

@szaher made their first contribution in #661

Full Changelog: v0.12.0...v0.12.1

@fynnsu

Full fine-tuning now supports gpt-oss models, alongside minor bugfixes to ensure correct loss calculations with higher gradient accumulation.

What's Changed

Disable workflow runs on forks by default by @fynnsu in #632
Adding GPT OSS Support by @Maxusmusti in #646
Update numpy from <2.0 to <2.3 by @Maxusmusti in #656
Add kernels>0.9.0 to CUDA requirements by @Maxusmusti in #658

Full Changelog: v0.11.1...v0.12.0

Releases: instructlab/training

v0.16.0 - On-Demand Checkpointing

instructlab-training 0.16.0 Release Notes

On-Demand Async Checkpointing for Preemption Recovery (#686)

Numba Compatibility Fix (#702)

What's Changed

Contributors

Uh oh!

v0.15.2 - Further Expanding VLM Compatibility

What's Changed

Contributors

Uh oh!

v0.15.1 - Expanded Text-Data VLM / Multi-Modal Training Support

What's Changed

Contributors

Uh oh!

v0.15.0 - Qwen3.5 VL Model Support

What's New

Features

Bug Fixes

Improvements

What's Changed

Contributors

Uh oh!

v0.14.2 - Validation Loss and Transformers V4 Backwards Compatability

What's Changed

Contributors

Uh oh!

v0.14.1 - Correct FSDP Config Behavior for Transformers v5

What's Changed

Contributors

Uh oh!

v0.14.0 - MLflow Support & Transformers v5 Compatibility

What's New

Features

Bug Fixes

Development

What's Changed

New Contributors

Files Changed

Contributors

Uh oh!

v0.13.0 - Pretraining Support & Optimizer Configuration

What's New

Features

Bug Fixes

Dependencies

Files Changed

Full Changelog

Uh oh!

v0.12.1 - Granite 4 support, and adding extended env var and torchrun arg support

What's Changed

New Contributors

Contributors

Uh oh!

v0.12.0 - GPT-OSS Support

What's Changed

Contributors

Uh oh!