Skip to content

Tags: instructlab/training

Tags

v0.16.0

Toggle v0.16.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Relax numba pin from >=0.62.0 to >=0.61.2 (#702)

Resolves dependency conflict with vllm which pins numba==0.61.2.
No 0.62-specific APIs are used.

v0.15.2

Toggle v0.15.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix trust_remote_code and gradient checkpointing for custom models (#696

)

* Fix: Add trust_remote_code=True for models with custom code

- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls
- Add torchrun path resolution (shutil.which with sys.executable fallback)
- Pass trust_remote_code=True to base_model_args and VLM helper functions

This fixes training failures for models like Nemotron that use remote code.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Fix: Handle models without gradient checkpointing support

Wrap gradient_checkpointing_enable() in try/except to handle models
like NemotronH that don't support gradient checkpointing.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Address reviewer feedback and fix ruff formatting

- Narrow exception handling in is_gpt_oss_model to catch specific
  exceptions (OSError, ValueError) instead of bare Exception, and
  log the failure details for debugging
- Add trust_remote_code=True to process_documents_for_pretraining()
  tokenizer loading for consistency with configure_tokenizer()
- Replace invalid fast_tokenizer kwarg with use_fast in
  tokenizer_utils.py setup_tokenizer()
- Create shared _enable_gradient_checkpointing_if_supported() helper
  on Model base class, catching ValueError, NotImplementedError, and
  AttributeError; use it in both LigerModel and CausalLMModel
- Improve torchrun fallback to use sys.executable -m
  torch.distributed.run instead of assuming a sibling script exists
- Fix ruff formatting for AutoConfig.from_pretrained call

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Fix unit tests to expect trust_remote_code=True

Update test assertions to expect trust_remote_code=True parameter
in AutoTokenizer.from_pretrained calls after adding this parameter
to process_documents_for_pretraining.

* Fix ruff formatting: break long assertion line

* Make trust_remote_code configurable via flag and environment variable

Instead of hardcoding trust_remote_code=True everywhere:

1. Add trust_remote_code field to TrainingArgs (default: False)
2. Add --trust_remote_code argparse flag to subprocess CLI
3. Support TRUST_REMOTE_CODE=1 environment variable
4. Thread the setting through Model, tokenizer, and config calls
5. Remove torchrun fallback — error if torchrun is not found
6. Remove unnecessary try/except in is_gpt_oss_model
7. Remove redundant use_fast=True from tokenizer_utils

The env var is exported by main() when the flag is set, so
downstream calls (data_process, tokenizer_utils, gpt_oss_utils)
automatically pick it up without needing explicit parameter threading.

* Document trust_remote_code in README

Add trust_remote_code to the TrainingArgs table and document the
TRUST_REMOTE_CODE environment variable in the environment variables
section.

* Enable local mamba kernel pre-population for NemotronH models

NemotronH has Mamba layers just like GraniteMoeHybrid and needs
the same _use_local_mamba_kernels() call to avoid causal_conv1d_cuda
import failures in torchrun subprocesses.

* Fix lint: revert torchrun shutil.which, remove unused imports, ruff format

The torchrun-not-found issue was caused by the venv not being
activated, not an installation problem. Revert to plain 'torchrun'
command. Remove now-unused shutil and sys imports. Run ruff format
on all modified files.

* Clarify trust_remote_code docs with security warning

* Add FP8 dequantization and requantization for Ministral VLM training

Ministral-3-3B ships with FP8 quantized weights that include scalar
parameters (weight_scale_inv, activation_scale) which FSDP rejects.
This change dequantizes FP8 weights to bf16 after VLM extraction for
training compatibility, preserves the original scales, and requantizes
back to FP8 at checkpoint save time so saved checkpoints match the
original FP8 format.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Ruff formatting fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Code <[email protected]>
Co-authored-by: Claude Sonnet 4.5 <[email protected]>
Co-authored-by: Mustafa Eyceoz <[email protected]>

v0.15.1

Toggle v0.15.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix Gemma 3 SFT training by detecting dual-registered VLM configs (#695)

Gemma 3 (google/gemma-3-4b-it) is dual-registered in transformers:
Gemma3Config maps to both MODEL_FOR_CAUSAL_LM_MAPPING and
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING. This caused is_vlm_with_causal_lm()
to return False (because the config IS in the CausalLM mapping), so the
model was loaded via AutoModelForCausalLM — which resolves to the full
Gemma3ForConditionalGeneration VLM class, not a text-only CausalLM.

The VLM forward pass then crashed during FSDP-wrapped distributed training
because the text-only SFT training loop doesn't handle the vision tower.

The fix checks what class MODEL_FOR_CAUSAL_LM_MAPPING actually resolves to.
If it's a ForConditionalGeneration class (a VLM), the model is treated as
needing backbone extraction, same as Ministral/Mistral3 models.

Tested with model_validation.py: both gemma-3-4b-it and gemma-3n-E4B-it
now train to loss 0.0000 on 1000-sample overfit dataset across 8x A100s.

Signed-off-by: Oleg Silkin <[email protected]>

v0.15.0

Toggle v0.15.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
add support for qwen3.5 vl model (#693)

* add support for qwen3.5 vl model

* enable detection of VLM models and allow using non-Hopper GPUs for GPT-OSS

* add broader vlm support

* add general vlm support

* support gemma3n

* address coderabbit review comments

- Fix eos_token_id truthiness check (0 is valid)
- Add isinstance guard for RopeParameters in mrope detection
- Add hasattr fallback for non-dict rope objects

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix CI: import sorting, pylint, and test mocks

- Fix isort ordering in vlm_utils.py and model.py
- Fix pylint: use 'from torch import nn', mark unused-argument
- Mock needs_sdpa and get_module_class_from_name in unit tests

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix ruff formatting for CI version (0.12.11)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* address remaining review comments

- accelerator.py: fall back to warning + default wrap policy instead of
  ValueError when no _no_split_modules resolve; try underlying HF model
  as secondary target
- model.py: use torch.cuda.current_device() instead of hardcoded 0
- vlm_utils.py: add trust_remote_code param (default False) to all
  config-loading functions; use init_empty_weights for CausalLM shell;
  copy quantization metadata from VLM

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix mamba kernel comments and exception handling

- Remove fabricated claim about C API incompatibility
- Accurately describe the issue as PyTorch/CUDA ABI mismatch
- Broaden exception handling to catch AttributeError

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

v0.14.2

Toggle v0.14.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Adds Validation Adds validation loss + exposes it in the API (#685)

* adds validation loss

* linting + testing

* more linting

* more linting

* more testing

v0.14.1

Toggle v0.14.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix _no_split_modules subscript error for transformers v5 (#683)

v0.14.0

Toggle v0.14.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add MLflow support and expose logging configuration in TrainingArgs (#…

…680)

* add support for mlflow

* fix formatting changes

* Add tensorboard_log_dir to TrainingArgs for configurable TensorBoard logging

- Add tensorboard_log_dir field to TrainingArgs in config.py
- Update setup_metric_logger to use tensorboard_log_dir when provided
- Add CLI argument for tensorboard_log_dir
- Wire tensorboard_log_dir through run_training() to subprocess command

This allows users to specify a custom directory for TensorBoard logs,
defaulting to output_dir if not specified.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* Address PR review feedback

- Replace defensive getattr() with direct attribute access in main_ds.py
  since args are guaranteed to exist from argparse defaults
- Remove unused log_dir parameter from MLflowHandler
- Add debug logging for non-numeric metrics skipped by MLflowHandler

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* removes generic `run_name` and `logger_type` kwargs

* review comments

* something something mlflow active runs

* review comments

* coderabbit

* adds install targets for logging backends

* add targets for loggers

* messaging

* comments

* interim changes

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>

v0.13.0

Toggle v0.13.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Exposes API for processing pretraining data (#672)

This commit enables the data processing code to create pre-training style datasets. The training loop is also updated to ingest pretraining-style datasets, where documents are chunked by some `block_size` and the chunks are then treated as independent and fully-unmasked samples.

v0.12.1

Toggle v0.12.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#…

…661)

* fix(torchrun): Omit empty arguments and correct nproc_per_node type

The command generation logic is updated to dynamically
build the torchrun command, excluding arguments that
are empty or None. This prevents them from overriding
environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is
made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been
changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

Signed-off-by: Saad Zaher <[email protected]>

* only dynamically add torchrun args & change rdzv_id type to str

Signed-off-by: Saad Zaher <[email protected]>

* fix smoke tests

Signed-off-by: Saad Zaher <[email protected]>

* Enable both dtypes str, int for nproc_per_node, rdzv_id

Signed-off-by: Saad Zaher <[email protected]>

* Use python3.11 style for pydatnic model

Signed-off-by: Saad Zaher <[email protected]>

* add all torchrun args and validate them

Signed-off-by: Saad Zaher <[email protected]>

* Remove non-required dependencies

Signed-off-by: Saad Zaher <[email protected]>

* update datatypes only

Signed-off-by: Saad Zaher <[email protected]>

* replace _ with - when passing torchrun args

Signed-off-by: Saad Zaher <[email protected]>

* make nproc_per_node to only accept gpu or int

Signed-off-by: Saad Zaher <[email protected]>

* add master_{addr, port} validate args

Signed-off-by: Saad Zaher <[email protected]>

* check for not set or empty rdzv endpoint

Signed-off-by: Saad Zaher <[email protected]>

* fix formatting error

Signed-off-by: Saad Zaher <[email protected]>

* Update src/instructlab/training/config.py

Signed-off-by: Saad Zaher <[email protected]>

* Update tests/smoke/test_train.py

Signed-off-by: Saad Zaher <[email protected]>

* Update src/instructlab/training/main_ds.py

Signed-off-by: Saad Zaher <[email protected]>

* fixes indentation

Signed-off-by: Oleg Silkin <[email protected]>

* formatting

* add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided

Signed-off-by: Oleg Silkin <[email protected]>

* clarify rdzv-backend arg

---------

Signed-off-by: Saad Zaher <[email protected]>
Signed-off-by: Oleg Silkin <[email protected]>
Co-authored-by: Oleg Silkin <[email protected]>

v0.12.0

Toggle v0.12.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add kernels>0.9.0 to CUDA requirements (#658)

Signed-off-by: Mustafa Eyceoz <[email protected]>