Releases: instructlab/training
v0.16.0 - On-Demand Checkpointing
instructlab-training 0.16.0 Release Notes
On-Demand Async Checkpointing for Preemption Recovery (#686)
Adds signal-driven checkpoint-and-exit for distributed training jobs running on OpenShift AI, KubeFlow, or multi-node bare metal, where jobs can be preempted with limited grace periods before SIGKILL.
How it works: When on_demand_checkpointing=True is set in TrainingArgs, the training process installs signal handlers for SIGTERM, SIGINT, SIGUSR1, and other termination signals. On signal receipt, a trigger file is written to /dev/shm. Worker processes check for this trigger at five synchronization points per training step (before/after forward, backward, and optimizer steps), ensuring response within one fwd+bwd cycle (~1-2s). When detected, all ranks collectively save a full-state distributed checkpoint (model + optimizer + LR scheduler) and exit gracefully.
Numba Compatibility Fix (#702)
Relaxes the numba version pin from >=0.62.0 to >=0.61.2, resolving dependency conflicts when installing alongside vLLM (which pins numba==0.61.2). No 0.62-specific APIs are used — all numba usage (@njit, int64) is compatible with 0.61.x.
What's Changed
- Add on-demand full-state checkpointing for OpenShift AI / KubeFlow preemption by @RobotSail in #686
- build(deps): Bump astral-sh/setup-uv from 4 to 7 by @dependabot[bot] in #692
- build(deps): Bump aws-actions/configure-aws-credentials from 4.2.1 to 6.0.0 by @dependabot[bot] in #689
- Relax numba pin from >=0.62.0 to >=0.61.2 by @Maxusmusti in #702
Full Changelog: v0.15.2...v0.16.0
v0.15.2 - Further Expanding VLM Compatibility
What's Changed
- Fix trust_remote_code and gradient checkpointing for custom models by @RobotSail @Maxusmusti in #696
Full Changelog: v0.15.1...v0.15.2
v0.15.1 - Expanded Text-Data VLM / Multi-Modal Training Support
What's Changed
- Fix Gemma 3 SFT training by detecting dual-registered VLM configs by @RobotSail in #695
Full Changelog: v0.15.0...v0.15.1
v0.15.0 - Qwen3.5 VL Model Support
What's New
Features
-
Vision-Language Model (VLM) Support for Text-Only Training (#693)
- Added automatic detection and loading of vision-language models for text-only training
- New
vlm_utils.pymodule with utilities for identifying and extracting CausalLM text backbones from VLM wrappers - Support for two VLM loading strategies: extracting the text backbone when a CausalLM sub-model exists, or direct VLM loading when no CausalLM variant is available
- Improved tokenizer/text-config reconciliation for VLMs where
vocab_sizelives undertext_config
-
Mixed Attention Handling for VLMs (#693)
- Models with
timmvision towers now use per-component attention:eagerfor vision,flash_attention_2orsdpafor text - Automatic SDPA fallback for M-RoPE models (e.g. Qwen3.5 VL) which are incompatible with Flash Attention 2
- Models with
Bug Fixes
-
FSDP Wrap Policy Robustness (#693)
- Fixed
_no_split_modulesresolution to handle models that declare module names for architectures not loaded (e.g. vision blocks when loading only the CausalLM) - FSDP wrap policy now resolves all declared module names against both the wrapper and underlying HF model, filtering out unresolvable entries
- Fixed
-
GPT-OSS Attention Capability Detection (#693)
vllm-flash-attn3is now gated behind a Hopper (SM 9.0+) GPU capability check, falling back toeageron older hardware
Improvements
- Local Mamba Kernel Preference (#693)
- GraniteMoeHybrid models now pre-populate the Hub kernel cache with locally installed
mamba_ssmandcausal_conv1dto avoid PyTorch/CUDA ABI mismatches with Hub-provided kernel builds
- GraniteMoeHybrid models now pre-populate the Hub kernel cache with locally installed
What's Changed
- add support for qwen3.5 vl model by @RobotSail in #693
Full Changelog: v0.14.2...v0.15.0
v0.14.2 - Validation Loss and Transformers V4 Backwards Compatability
What's Changed
- Add backwards compatibility for transformers v4.57 by @Maxusmusti in #684
- Adds Validation Adds validation loss + exposes it in the API by @RobotSail in #685
Full Changelog: v0.14.1...v0.14.2
v0.14.1 - Correct FSDP Config Behavior for Transformers v5
What's Changed
- fix _no_split_modules subscript error for transformers v5 by @Maxusmusti in #683
Full Changelog: v0.14.0...v0.14.1
v0.14.0 - MLflow Support & Transformers v5 Compatibility
What's New
Features
-
MLflow Logging Backend (#680)
- Added
MLflowHandlerclass for logging training metrics to MLflow - New
TrainingArgsfields:mlflow_tracking_uri,mlflow_experiment_name,mlflow_run_name - Added
wandb_project,wandb_entity,wandb_run_namefields for W&B configuration - Added
tensorboard_log_dirfield for configurable TensorBoard log directory - New optional install targets:
requirements-mlflow.txt,requirements-wandb.txt,requirements-tensorboard.txt
- Added
-
Transformers v5 Compatibility (#681)
- Updated tokenizer API calls to use
extra_special_tokensinstead ofadditional_special_tokens - Suppressed verbose httpx HTTP request logs from huggingface_hub
- Updated tokenizer API calls to use
Bug Fixes
- HYBRID_SHARD Failure Fix (#682)
- Added detection for when
world_size < num_devices_per_nodein FSDP configuration - Automatically falls back to
FULL_SHARDwith a warning whenHYBRID_SHARDwould fail
- Added detection for when
Development
- Tox-UV Integration (#676)
- Added
tox-uvas a tox requirement withuv-venv-runner - Updated GitHub workflows to use
uvfor package installation - Replaced
pip installwithuv pip installin CI workflows
- Added
What's Changed
- adds integration for tox-uv and updates workflows to use tox-uv by @RobotSail in #676
- Add transformers v5 compatibility by @Maxusmusti in #681
- Fix HYBRID_SHARD failure when world_size < available GPUs by @rtj1 in #682
- Add MLflow support and expose logging configuration in TrainingArgs by @RobotSail in #680
New Contributors
Files Changed
18 files changed with 482 insertions and 83 deletions:
- Core training modules:
logger.py,config.py,accelerator.py,data_process.py,tokenizer_utils.py,main_ds.py - New requirements files for optional logging backends
- Updated CI workflows and tox configuration
Full Changelog: v0.13.0...v0.14.0
v0.13.0 - Pretraining Support & Optimizer Configuration
What's New
Features
-
Pretraining Data Processing API (#672)
- Added new API for processing pretraining-style datasets
- Documents are now chunked by configurable
block_size - Chunks are treated as independent, fully-unmasked samples
- Updated training loop to ingest pretraining-style datasets
- Includes comprehensive test coverage (
test_pretraining_data_process.py,test_pretraining_mode.py,test_pretraining_sampler.py)
-
AdamW Optimizer Configuration (#674)
- Exposed
weight_decay,betas, andepsparameters in TrainingArgs - Users can now tune AdamW hyperparameters through
run_training()API - Provides more control over optimizer behavior
- Exposed
-
Granite 4 Model Support (#669)
- Added support for Granite 4 models as Mixture of Experts (MoE) models in training
Bug Fixes
-
Process Timing Fix (#675)
- Fixed race condition where process wasn't completed by the time it was read
-
Variable Access Fix (#668)
- Fixed invalid variable access bug
Dependencies
- Build Dependency Update (#670)
- Updated hynek build dependency
Files Changed
17 files changed with 1,642 insertions and 52 deletions:
- Core training modules:
data_process.py,main_ds.py,sampler.py,model.py,config.py - New test suites for pretraining functionality
- Updated README with new capabilities
Full Changelog
All Changes:
- 574f946 Exposes API for processing pretraining data (#672)
- 638a753 fixes bug where process isn't completed by the time the process gets read (#675)
- c495035 Expose AdamW optimizer parameters in training API (#674)
- 3d05302 Handle granite 4 as MoE models in training (#669)
- 781c36f fixes stray invalid variable access bug (#668)
- 529c2f7 bumps hynek build dep (#670)
Full Diff: v0.12.1...v0.13.0
v0.12.1 - Granite 4 support, and adding extended env var and torchrun arg support
What's Changed
- Update requirements-cuda.txt to increase liger-kernel minimum by @Maxusmusti in #659
- Adds mamba-ssm[causal-conv1d] to CUDA requirements by @RobotSail in #663
- Removes Numpy version cap by @RobotSail in #664
- fix(torchrun): Omit empty arguments and correct nproc_per_node type by @szaher in #661
New Contributors
Full Changelog: v0.12.0...v0.12.1
v0.12.0 - GPT-OSS Support
Full fine-tuning now supports gpt-oss models, alongside minor bugfixes to ensure correct loss calculations with higher gradient accumulation.
What's Changed
- Disable workflow runs on forks by default by @fynnsu in #632
- Adding GPT OSS Support by @Maxusmusti in #646
- Update numpy from <2.0 to <2.3 by @Maxusmusti in #656
- Add kernels>0.9.0 to CUDA requirements by @Maxusmusti in #658
Full Changelog: v0.11.1...v0.12.0