Tags: instructlab/training
Tags
Fix trust_remote_code and gradient checkpointing for custom models (#696 ) * Fix: Add trust_remote_code=True for models with custom code - Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls - Add torchrun path resolution (shutil.which with sys.executable fallback) - Pass trust_remote_code=True to base_model_args and VLM helper functions This fixes training failures for models like Nemotron that use remote code. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Fix: Handle models without gradient checkpointing support Wrap gradient_checkpointing_enable() in try/except to handle models like NemotronH that don't support gradient checkpointing. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Address reviewer feedback and fix ruff formatting - Narrow exception handling in is_gpt_oss_model to catch specific exceptions (OSError, ValueError) instead of bare Exception, and log the failure details for debugging - Add trust_remote_code=True to process_documents_for_pretraining() tokenizer loading for consistency with configure_tokenizer() - Replace invalid fast_tokenizer kwarg with use_fast in tokenizer_utils.py setup_tokenizer() - Create shared _enable_gradient_checkpointing_if_supported() helper on Model base class, catching ValueError, NotImplementedError, and AttributeError; use it in both LigerModel and CausalLMModel - Improve torchrun fallback to use sys.executable -m torch.distributed.run instead of assuming a sibling script exists - Fix ruff formatting for AutoConfig.from_pretrained call Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Fix unit tests to expect trust_remote_code=True Update test assertions to expect trust_remote_code=True parameter in AutoTokenizer.from_pretrained calls after adding this parameter to process_documents_for_pretraining. * Fix ruff formatting: break long assertion line * Make trust_remote_code configurable via flag and environment variable Instead of hardcoding trust_remote_code=True everywhere: 1. Add trust_remote_code field to TrainingArgs (default: False) 2. Add --trust_remote_code argparse flag to subprocess CLI 3. Support TRUST_REMOTE_CODE=1 environment variable 4. Thread the setting through Model, tokenizer, and config calls 5. Remove torchrun fallback — error if torchrun is not found 6. Remove unnecessary try/except in is_gpt_oss_model 7. Remove redundant use_fast=True from tokenizer_utils The env var is exported by main() when the flag is set, so downstream calls (data_process, tokenizer_utils, gpt_oss_utils) automatically pick it up without needing explicit parameter threading. * Document trust_remote_code in README Add trust_remote_code to the TrainingArgs table and document the TRUST_REMOTE_CODE environment variable in the environment variables section. * Enable local mamba kernel pre-population for NemotronH models NemotronH has Mamba layers just like GraniteMoeHybrid and needs the same _use_local_mamba_kernels() call to avoid causal_conv1d_cuda import failures in torchrun subprocesses. * Fix lint: revert torchrun shutil.which, remove unused imports, ruff format The torchrun-not-found issue was caused by the venv not being activated, not an installation problem. Revert to plain 'torchrun' command. Remove now-unused shutil and sys imports. Run ruff format on all modified files. * Clarify trust_remote_code docs with security warning * Add FP8 dequantization and requantization for Ministral VLM training Ministral-3-3B ships with FP8 quantized weights that include scalar parameters (weight_scale_inv, activation_scale) which FSDP rejects. This change dequantizes FP8 weights to bf16 after VLM extraction for training compatibility, preserves the original scales, and requantizes back to FP8 at checkpoint save time so saved checkpoints match the original FP8 format. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Ruff formatting fixes Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Code <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> Co-authored-by: Mustafa Eyceoz <[email protected]>
Fix Gemma 3 SFT training by detecting dual-registered VLM configs (#695) Gemma 3 (google/gemma-3-4b-it) is dual-registered in transformers: Gemma3Config maps to both MODEL_FOR_CAUSAL_LM_MAPPING and MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING. This caused is_vlm_with_causal_lm() to return False (because the config IS in the CausalLM mapping), so the model was loaded via AutoModelForCausalLM — which resolves to the full Gemma3ForConditionalGeneration VLM class, not a text-only CausalLM. The VLM forward pass then crashed during FSDP-wrapped distributed training because the text-only SFT training loop doesn't handle the vision tower. The fix checks what class MODEL_FOR_CAUSAL_LM_MAPPING actually resolves to. If it's a ForConditionalGeneration class (a VLM), the model is treated as needing backbone extraction, same as Ministral/Mistral3 models. Tested with model_validation.py: both gemma-3-4b-it and gemma-3n-E4B-it now train to loss 0.0000 on 1000-sample overfit dataset across 8x A100s. Signed-off-by: Oleg Silkin <[email protected]>
add support for qwen3.5 vl model (#693) * add support for qwen3.5 vl model * enable detection of VLM models and allow using non-Hopper GPUs for GPT-OSS * add broader vlm support * add general vlm support * support gemma3n * address coderabbit review comments - Fix eos_token_id truthiness check (0 is valid) - Add isinstance guard for RopeParameters in mrope detection - Add hasattr fallback for non-dict rope objects Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix CI: import sorting, pylint, and test mocks - Fix isort ordering in vlm_utils.py and model.py - Fix pylint: use 'from torch import nn', mark unused-argument - Mock needs_sdpa and get_module_class_from_name in unit tests Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix ruff formatting for CI version (0.12.11) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * address remaining review comments - accelerator.py: fall back to warning + default wrap policy instead of ValueError when no _no_split_modules resolve; try underlying HF model as secondary target - model.py: use torch.cuda.current_device() instead of hardcoded 0 - vlm_utils.py: add trust_remote_code param (default False) to all config-loading functions; use init_empty_weights for CausalLM shell; copy quantization metadata from VLM Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix mamba kernel comments and exception handling - Remove fabricated claim about C API incompatibility - Accurately describe the issue as PyTorch/CUDA ABI mismatch - Broaden exception handling to catch AttributeError Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Add MLflow support and expose logging configuration in TrainingArgs (#… …680) * add support for mlflow * fix formatting changes * Add tensorboard_log_dir to TrainingArgs for configurable TensorBoard logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <[email protected]> * Address PR review feedback - Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <[email protected]> * removes generic `run_name` and `logger_type` kwargs * review comments * something something mlflow active runs * review comments * coderabbit * adds install targets for logging backends * add targets for loggers * messaging * comments * interim changes --------- Co-authored-by: Claude Opus 4.5 <[email protected]>
Exposes API for processing pretraining data (#672) This commit enables the data processing code to create pre-training style datasets. The training loop is also updated to ingest pretraining-style datasets, where documents are chunked by some `block_size` and the chunks are then treated as independent and fully-unmasked samples.
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#… …661) * fix(torchrun): Omit empty arguments and correct nproc_per_node type The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value. Additionally, the nproc_per_node argument type has been changed from int to str to support special values accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'. Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88 Signed-off-by: Saad Zaher <[email protected]> * only dynamically add torchrun args & change rdzv_id type to str Signed-off-by: Saad Zaher <[email protected]> * fix smoke tests Signed-off-by: Saad Zaher <[email protected]> * Enable both dtypes str, int for nproc_per_node, rdzv_id Signed-off-by: Saad Zaher <[email protected]> * Use python3.11 style for pydatnic model Signed-off-by: Saad Zaher <[email protected]> * add all torchrun args and validate them Signed-off-by: Saad Zaher <[email protected]> * Remove non-required dependencies Signed-off-by: Saad Zaher <[email protected]> * update datatypes only Signed-off-by: Saad Zaher <[email protected]> * replace _ with - when passing torchrun args Signed-off-by: Saad Zaher <[email protected]> * make nproc_per_node to only accept gpu or int Signed-off-by: Saad Zaher <[email protected]> * add master_{addr, port} validate args Signed-off-by: Saad Zaher <[email protected]> * check for not set or empty rdzv endpoint Signed-off-by: Saad Zaher <[email protected]> * fix formatting error Signed-off-by: Saad Zaher <[email protected]> * Update src/instructlab/training/config.py Signed-off-by: Saad Zaher <[email protected]> * Update tests/smoke/test_train.py Signed-off-by: Saad Zaher <[email protected]> * Update src/instructlab/training/main_ds.py Signed-off-by: Saad Zaher <[email protected]> * fixes indentation Signed-off-by: Oleg Silkin <[email protected]> * formatting * add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided Signed-off-by: Oleg Silkin <[email protected]> * clarify rdzv-backend arg --------- Signed-off-by: Saad Zaher <[email protected]> Signed-off-by: Oleg Silkin <[email protected]> Co-authored-by: Oleg Silkin <[email protected]>
PreviousNext