Changelog
Track all notable changes, bug fixes, and improvements to AITraining.2026-01-12 (v0.0.26)
Bug Fixes
- Fix tool_calls content duplication in training data
- Fix tokenizer settings and turn marker validation
- Fix pre-tokenization for TRL 0.26 compatibility
- Fix completion_mask generation during preprocessing
2026-01-11 (v0.0.25)
Feature: Response-Only Training (SFT Label Masking)
Major change for proper SFT behavior. Models now see the full conversation context in attention but only compute loss on assistant responses. This is the expected behavior for supervised fine-tuning and post-training. Why this matters:- SFT/Post-training: Train the model to generate good responses given context. The model should attend to user messages and system prompts but only be trained to predict assistant outputs.
- Pre-training: Different goal - maximize generalization and memorization across all tokens.
- Full attention mask: Model sees entire conversation (system + user + assistant)
- Label masking: Loss computed only on assistant/completion tokens
- Result: Model learns response patterns without memorizing prompts
--response-only-loss (default: true)
Supported models: Gemma, Qwen, Llama, Phi, Mistral (auto-detects response templates)
Commit: 87a87c1
2026-01-10 (v0.0.24)
Change: OpenAI Format for Tool Calls Serialization
Change: Tool calls are now serialized in full OpenAI format instead of the simplified format. This matches the format used in system prompt instructions for better model learning. Before (v0.0.23):2026-01-10 (v0.0.23)
Change: Plain JSON for Tool Calls Serialization
Change: Removed the[Tool Call] prefix from serialized tool calls. Tool calls are now output as plain JSON for cleaner training data.
Before:
2026-01-10 (v0.0.22)
Feature: Tools Definitions Injection for Non-Native Models
New: Models that don’t natively support thetools parameter (like Gemma) can now train on function calling data with tool definitions.
How it works:
- Detects if tokenizer supports
toolsparameter natively - If not supported, injects tool definitions as formatted text into the system prompt (or first user message)
- Models learn to understand and respond to tool definitions
check_tools_support()- Detects native tools parameter supportformat_tools_as_text()- Formats tool definitions as readable textinject_tools_into_messages()- Injects tools into system/user message
2026-01-08 (v0.0.21)
Fix: SFTTrainer Using Wrong Column After Chat Template Processing
Issue: When chat template processing convertedmessages to text, SFTTrainer was still trying to use the original messages column. This caused tokenization errors because it tried to tokenize a list instead of the processed string.
Fix: Now correctly sets dataset_text_field='text' when chat template is applied.
Commit: c2bdf05
2026-01-08 (v0.0.20)
Fix: Double BOS Token Issue
Issue: When training with pre-processed datasets or using chat templates, models would get duplicate BOS tokens (e.g.,<bos><bos> or <|begin_of_text|><|begin_of_text|>). This happened because the chat template added BOS, and then the tokenizer added another one during training.
Fix: BOS tokens are now stripped from rendered text before saving to processed datasets. This allows the tokenizer to add BOS correctly during training, preventing duplicates. Works universally for all tokenizers:
- Gemma:
<bos> - Llama 3:
<|begin_of_text|> - Llama 2/Mistral:
<s>
Fix: BOS Stripping for Already-Formatted Data
Issue: When loading datasets that were previously processed with chat templates, Llama 3 (which lacks theadd_bos_token attribute) would always get double BOS tokens.
Fix: BOS tokens are now stripped directly from text data when loading already-formatted datasets. This works for any tokenizer with a bos_token defined.
Commit: 24a3af9
Feature: Preserve Original Messages Column
Issue: Processing overwrote the originalmessages column, making it impossible to inspect the source data. Other tools could also auto-detect and incorrectly use the unprocessed column.
Fix: Processing now:
- Creates a
textcolumn with formatted output - Renames original columns to
_original_*prefix (e.g.,_original_messages) - Prevents auto-detection conflicts with other frameworks
Feature: Processed Dataset Saving and Model Card Improvements
New: Processed training data is now automatically saved:- Locally to
{project}/data_processed/ - Optionally to Hub as private dataset
- New CLI param:
--save-processed-data(auto|local|hub|both|none)
- Training details table (base model, trainer, dataset, epochs, LR, etc.)
- Extra params section (LoRA rank/alpha, quantization, chat template)
- Updated links to AITraining GitHub repo
Fix: Clean Tool Call Serialization and Legacy Function Role Support
Issue: Tool calls were serialized using the raw OpenAI format with nested"function" key, making training data verbose and format-specific. Additionally, the older OpenAI "function" role (used for tool responses before the "tool" role existed) was not handled.
Fix:
- Tool calls are now serialized to a clean format:
- Before:
[Tool Calls] [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "..."}}] - After:
[Tool Call] {"tool": "search", "arguments": {"query": "weather"}}
- Before:
- The
"function"role (older OpenAI format) is now handled the same as"tool"role - converted to"user"with[Tool Result]prefix for models that don’t support it natively.
Fix: Complete tool_calls Preservation Across All Code Paths
Issue: The v0.0.18 fix for tool_calls was incomplete - several code paths still dropped tool_calls:render_conversation()in message renderer blindly serialized without checking tokenizer support- Fallback functions in
project.pyandpreprocessor/llm.pydropped tool_calls format_chat_prompt()andbuild_supervised_example()in rendering utils dropped tool_calls
- Added
_check_tool_calls_support()toTokenizerNativeRendererto detect native support render_conversation()now:- Passes tool_calls through natively for models that support it (Qwen, Llama 3.1+)
- Only serializes to JSON for models that don’t (Gemma)
- All code paths now preserve tool_calls when creating Message objects
- Fallback functions preserve tool_calls in content
2026-01-07
Fix: tool_calls Field Being Dropped in Training Data
Issue: When training data containstool_calls field (from function calling conversations), the field was silently dropped. Models never learned to make tool calls.
Root Cause: The Message class only extracted role and content from messages:
- Detects if the tokenizer supports
tool_callsnatively (Qwen, Llama 3.1+) - Preserves native format for models that support it
- Serializes to JSON in content for models that don’t (Gemma, older models)
[Tool Call] JSON, execute the tool, and don’t show the JSON to the user.
Fix: Message Alternation Errors with Strict Models
Issue: Training data with consecutive same-role messages orsystem → assistant patterns (without a user message in between) failed on strict-alternation models like Gemma:
- Consecutive assistant messages (e.g., multi-part responses)
- System message followed directly by assistant (no user prompt)
- Multiple user messages in a row
- Merges consecutive same-role messages (preserving content)
- Inserts placeholder
[Continued]user messages when assistant follows system/assistant - Only applies when the tokenizer rejects the format (dynamic detection)
Fix: Tool Role Messages Breaking Native Tokenizer Rendering
Issue: When training data containstool role messages (from function calling), models that require strict user/assistant alternation (like Gemma) would fail with:
TokenizerNativeRenderer passed messages directly to tokenizer.apply_chat_template() without preprocessing. Tokenizers like Gemma don’t support the tool role.
Fix: Added smart tool role handling that:
- Detects if the tokenizer supports
toolrole by testing with a sample message (result is cached) - Only converts
tool→userwith[Tool Result]prefix when the tokenizer doesn’t support it - Preserves native tool handling for models that support it (Llama 3.1+, Mistral, etc.)
- Merges consecutive same-role messages to maintain strict alternation when needed
Fix: Chat Template “tokenizer” Incorrectly Using ChatML Format
Issue: When using--chat-template tokenizer (the default for SFT training), the system incorrectly used ChatML format instead of the model’s native chat template. This caused ChatML tokens (<|im_start|>, <|im_end|>) to be added as literal text in training data.
Impact: Models trained with this bug learned to output ChatML tokens as regular text. For example, a Gemma model would output:
clm/utils.py, the chat format mapping had:
ChatMLRenderer to be used (which adds ChatML tokens via string concatenation) instead of TokenizerNativeRenderer (which correctly uses tokenizer.apply_chat_template()).
Fix: Changed the mapping to:
--chat-template tokenizer or the SFT trainer default.
Retraining required: Models trained before this fix that exhibit ChatML token output need to be retrained.
Fix: HuggingFace Push Using Full Path as Repo Name (All Trainers)
Issue: Whenproject_name was a full path like /workspace/trainings/my-model, pushing to HuggingFace Hub created an invalid repo ID like username//workspace/trainings/my-model.
Fix: Now uses basename(project_name) to extract just the folder name, creating valid repo IDs like username/my-model.
Affected trainers (all fixed):
- CLM (LLM fine-tuning)
- VLM (Vision-Language Models)
- Text Classification
- Text Regression
- Token Classification
- Sentence Transformers
- Image Classification
- Image Regression
- Object Detection
- Seq2Seq
- Extractive QA
- Tabular
Feature: —repo-id Parameter for Custom HuggingFace Destination
Added--repo-id CLI parameter to specify a custom HuggingFace repository destination. Useful for:
- Pushing to an organization instead of your personal account
- Using a different repo name than your local
project_name
--repo-id is set, --username is not required since the repo ID already specifies the destination.
Feature: Post-Trial Actions for Hyperparameter Sweeps
Added ability to execute custom actions after each sweep trial completes. CLI Usage:TRIAL_NUMBER- Trial index (0-based)TRIAL_METRIC_VALUE- Metric value for this trialTRIAL_IS_BEST- Whether this is the best trial so far (true/false)TRIAL_OUTPUT_DIR- Output directory for the trialTRIAL_PARAMS- Trial parameters as string
2026-01-06
Feature: —wandb-run-id Parameter for Run Resumption
Added--wandb-run-id CLI parameter to resume an existing W&B run instead of creating a new one. Useful when running AITraining from external W&B sweep agents.
Usage:
WANDB_RESUME=allow so the trainer resumes the specified run instead of creating a duplicate.
Fix: Duplicate W&B Runs in Sweeps
Issue: Each sweep trial was creating 2 W&B runs - one from the sweep code and one from the trainer. Root Cause: Sweep code calledwandb.init(), then trainer also called wandb.init() internally, creating a duplicate run.
Fix: After sweep’s wandb.init(), set WANDB_RUN_ID and WANDB_RESUME=allow env vars so the trainer resumes the same run instead of creating a new one.
Improvement: Better Error Message for Missing Text Column
When dataset has amessages column but training expects text, the error now suggests the fix:
Fix: WANDB_PROJECT Using Path Instead of Name
Issue: Running sweeps with W&B logging failed with:config.project_name (the output path) instead of just the project name when falling back.
Fix: Use os.path.basename(config.project_name) to extract just the project name from the path.
Fix: Model Loaded in float32 Instead of bf16/fp16 on CUDA
Issue: When usingmixed_precision=bf16 or fp16 on CUDA, the model was loaded in float32, causing 2x VRAM usage.
Root Cause: The torch_dtype parameter wasn’t being passed to from_pretrained() in the CUDA code path. Only MPS had dtype conversion.
Impact:
- Model weights used 2x more VRAM than necessary
- Training still worked (trainer used bf16 for compute), but was suboptimal
torch_dtype to model_kwargs when CUDA is available:
Fix: W&B Sweep Logs to Wrong Project
Issue: During sweeps with W&B logging, trainer runs were logged to the default “huggingface” project instead of the configured sweep project. Root Cause: The sweep createdwandb.init() with the correct project, but the trainer’s internal wandb.init() didn’t know about it.
Fix: Set WANDB_PROJECT and WANDB_ENTITY environment variables before calling the trainer, so any subsequent wandb.init() uses the correct project.
Fix: bitsandbytes CUDA 12.x Compatibility
Issue: Training with LoRA failed on CUDA 12.8 environments with:==0.42.0 to >=0.45.0. Version 0.45.0+ uses a new multi-backend system that doesn’t require version-specific CUDA binaries.
Commit: f13a068
2026-01-05
Feature: W&B Native Sweep Integration
Added native Weights & Biases sweep support for hyperparameter optimization. When enabled, sweep runs are grouped in W&B’s native sweep dashboard, providing aggregated views and parallel coordinates plots. New Parameters:wandb_sweep: Enable W&B native sweep dashboard (default:false)wandb_sweep_project: W&B project name for sweep (defaults toproject_name)wandb_sweep_entity: W&B entity (team/username) for sweepwandb_sweep_id: Existing sweep ID to continue (skips creating new sweep)
wandb_sweep is enabled, each trial run is linked to the sweep via wandb.init(group=sweep_id), creating an aggregated view in W&B.
Commit: e49abc9
Fix: CLI Missing FIELD_SCOPES for W&B Sweep Parameters
Issue: Runningautotrain llm --wandb-sweep via CLI failed with:
LLMTrainingParams but not to FIELD_SCOPES in the CLI argument parser.
Fix: Added the missing fields to FIELD_SCOPES and added a test to prevent this regression.
Note: This only affected the CLI (autotrain llm ...). The Python API and TUI were not affected.
Commit: 7994989
Fix: Sweep Parameters Accept Dict Format
Fixedsweep_params to accept both list and dict formats. Previously only list format worked, now both are supported:
categorical, loguniform, uniform, int.
Commit: 15aa38a
Fix: Auto-detect model_max_length from Model Config
Previouslymodel_max_length defaulted to 2048 regardless of model capability, causing block_size to be silently capped even when the model supports longer sequences.
The Problem:
- Gemma 3 supports 32K-128K context (depending on variant), but
block_sizewas capped to 2048 - Users had to manually set
--model-max-lengthto use longer sequences
- Auto-detect
max_position_embeddingsfrom model config - Handles VLMs (reads from
text_config) and regular LLMs - Falls back to 2048 with warning if auto-detect fails
- User can still override with
--model-max-length
Dependency Update: Gemma 3n Support
Updated dependencies to support Gemma 3n and other new models:transformers: 4.57.1 → 4.57.3timm: 1.0.12 → 1.0.22 (addsmobilenetv5_300m_encfor Gemma 3n vision tower)huggingface_hub: ==0.34.4 → >=0.34.0 (flexible constraint)
2025-12-02
Bug Fix: ORPO Training Beta Parameter Not Applied
Issue: Thedpo_beta parameter was not being passed to TRL’s ORPOConfig during ORPO training, causing user-specified beta values to be silently ignored.
Impact: Users setting dpo_beta for ORPO training (e.g., dpo_beta=0.5) would have their setting ignored. ORPO would always use TRL’s default value of 0.1 regardless of user configuration.
Root Cause: In train_clm_orpo.py, the code was missing the line to pass the beta parameter to ORPOConfig:
training_args["beta"] = config.dpo_beta to ensure the user’s beta value is passed to ORPO training.
Test Added: New test test_orpo_beta_parameter verifies that different beta values (0.01, 0.1, 0.5) are correctly applied during ORPO training.
Commit: a37e288
For questions or issues, please open an issue on GitHub.