Skip to main content

Changelog

Track all notable changes, bug fixes, and improvements to AITraining.

2026-01-12 (v0.0.26)

Bug Fixes

  • Fix tool_calls content duplication in training data
  • Fix tokenizer settings and turn marker validation
  • Fix pre-tokenization for TRL 0.26 compatibility
  • Fix completion_mask generation during preprocessing

2026-01-11 (v0.0.25)

Feature: Response-Only Training (SFT Label Masking)

Major change for proper SFT behavior. Models now see the full conversation context in attention but only compute loss on assistant responses. This is the expected behavior for supervised fine-tuning and post-training. Why this matters:
  • SFT/Post-training: Train the model to generate good responses given context. The model should attend to user messages and system prompts but only be trained to predict assistant outputs.
  • Pre-training: Different goal - maximize generalization and memorization across all tokens.
How it works with TRL 0.26:
  1. Full attention mask: Model sees entire conversation (system + user + assistant)
  2. Label masking: Loss computed only on assistant/completion tokens
  3. Result: Model learns response patterns without memorizing prompts
New parameter: --response-only-loss (default: true) Supported models: Gemma, Qwen, Llama, Phi, Mistral (auto-detects response templates) Commit: 87a87c1

2026-01-10 (v0.0.24)

Change: OpenAI Format for Tool Calls Serialization

Change: Tool calls are now serialized in full OpenAI format instead of the simplified format. This matches the format used in system prompt instructions for better model learning. Before (v0.0.23):
{"tool": "get_weather", "arguments": {"location": "Paris"}}
After (v0.0.24):
{"content": "Let me check the weather.", "tool_calls": [{"id": "call_001", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"Paris\"}"}}]}
Commit: 3f6bc15

2026-01-10 (v0.0.23)

Change: Plain JSON for Tool Calls Serialization

Change: Removed the [Tool Call] prefix from serialized tool calls. Tool calls are now output as plain JSON for cleaner training data. Before:
[Tool Call] {"tool": "get_weather", "arguments": {"location": "Paris"}}
After:
{"tool": "get_weather", "arguments": {"location": "Paris"}}
Also removed: The format instruction footer from tool definitions injection (models learn the format from examples). Commit: cde1948

2026-01-10 (v0.0.22)

Feature: Tools Definitions Injection for Non-Native Models

New: Models that don’t natively support the tools parameter (like Gemma) can now train on function calling data with tool definitions. How it works:
  1. Detects if tokenizer supports tools parameter natively
  2. If not supported, injects tool definitions as formatted text into the system prompt (or first user message)
  3. Models learn to understand and respond to tool definitions
Functions added:
  • check_tools_support() - Detects native tools parameter support
  • format_tools_as_text() - Formats tool definitions as readable text
  • inject_tools_into_messages() - Injects tools into system/user message
Example injection:
You have access to the following tools:

1. get_weather
   Description: Get current weather for a location
   Parameters:
   - location (string, required): City name
   - units (string, optional): celsius or fahrenheit
Commit: a4af6fe

2026-01-08 (v0.0.21)

Fix: SFTTrainer Using Wrong Column After Chat Template Processing

Issue: When chat template processing converted messages to text, SFTTrainer was still trying to use the original messages column. This caused tokenization errors because it tried to tokenize a list instead of the processed string. Fix: Now correctly sets dataset_text_field='text' when chat template is applied. Commit: c2bdf05

2026-01-08 (v0.0.20)

Fix: Double BOS Token Issue

Issue: When training with pre-processed datasets or using chat templates, models would get duplicate BOS tokens (e.g., <bos><bos> or <|begin_of_text|><|begin_of_text|>). This happened because the chat template added BOS, and then the tokenizer added another one during training. Fix: BOS tokens are now stripped from rendered text before saving to processed datasets. This allows the tokenizer to add BOS correctly during training, preventing duplicates. Works universally for all tokenizers:
  • Gemma: <bos>
  • Llama 3: <|begin_of_text|>
  • Llama 2/Mistral: <s>
Commit: b124223

Fix: BOS Stripping for Already-Formatted Data

Issue: When loading datasets that were previously processed with chat templates, Llama 3 (which lacks the add_bos_token attribute) would always get double BOS tokens. Fix: BOS tokens are now stripped directly from text data when loading already-formatted datasets. This works for any tokenizer with a bos_token defined. Commit: 24a3af9

Feature: Preserve Original Messages Column

Issue: Processing overwrote the original messages column, making it impossible to inspect the source data. Other tools could also auto-detect and incorrectly use the unprocessed column. Fix: Processing now:
  1. Creates a text column with formatted output
  2. Renames original columns to _original_* prefix (e.g., _original_messages)
  3. Prevents auto-detection conflicts with other frameworks
Commit: f73a7e3, bb146bb

Feature: Processed Dataset Saving and Model Card Improvements

New: Processed training data is now automatically saved:
  • Locally to {project}/data_processed/
  • Optionally to Hub as private dataset
  • New CLI param: --save-processed-data (auto|local|hub|both|none)
Model Card Improvements:
  • Training details table (base model, trainer, dataset, epochs, LR, etc.)
  • Extra params section (LoRA rank/alpha, quantization, chat template)
  • Updated links to AITraining GitHub repo
Commit: 299b873

Fix: Clean Tool Call Serialization and Legacy Function Role Support

Issue: Tool calls were serialized using the raw OpenAI format with nested "function" key, making training data verbose and format-specific. Additionally, the older OpenAI "function" role (used for tool responses before the "tool" role existed) was not handled. Fix:
  1. Tool calls are now serialized to a clean format:
    • Before: [Tool Calls] [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "..."}}]
    • After: [Tool Call] {"tool": "search", "arguments": {"query": "weather"}}
  2. The "function" role (older OpenAI format) is now handled the same as "tool" role - converted to "user" with [Tool Result] prefix for models that don’t support it natively.
Example:
# Input with OpenAI format
{
    "role": "assistant",
    "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "{\"q\": \"test\"}"}}]
}

# Output (clean format)
{
    "role": "assistant",
    "content": "[Tool Call] {\"tool\": \"search\", \"arguments\": {\"q\": \"test\"}}"
}
Commit: 5bbbdd8

Fix: Complete tool_calls Preservation Across All Code Paths

Issue: The v0.0.18 fix for tool_calls was incomplete - several code paths still dropped tool_calls:
  • render_conversation() in message renderer blindly serialized without checking tokenizer support
  • Fallback functions in project.py and preprocessor/llm.py dropped tool_calls
  • format_chat_prompt() and build_supervised_example() in rendering utils dropped tool_calls
Fix:
  1. Added _check_tool_calls_support() to TokenizerNativeRenderer to detect native support
  2. render_conversation() now:
    • Passes tool_calls through natively for models that support it (Qwen, Llama 3.1+)
    • Only serializes to JSON for models that don’t (Gemma)
  3. All code paths now preserve tool_calls when creating Message objects
  4. Fallback functions preserve tool_calls in content
Pattern: All main code paths now check tokenizer support before converting. This matches the existing pattern for tool role detection.

2026-01-07

Fix: tool_calls Field Being Dropped in Training Data

Issue: When training data contains tool_calls field (from function calling conversations), the field was silently dropped. Models never learned to make tool calls. Root Cause: The Message class only extracted role and content from messages:
Message(role=m["role"], content=m["content"])  # tool_calls ignored!
Fix: Added smart tool_calls handling that:
  1. Detects if the tokenizer supports tool_calls natively (Qwen, Llama 3.1+)
  2. Preserves native format for models that support it
  3. Serializes to JSON in content for models that don’t (Gemma, older models)
Example for models without native support:
# Input with tool_calls
{
    "role": "assistant",
    "content": "Let me check.",
    "tool_calls": [{"function": {"name": "weather", "arguments": "{\"city\": \"Paris\"}"}}]
}

# Output (auto-serialized for Gemma)
{
    "role": "assistant",
    "content": "Let me check.\n[Tool Call] {\"tool\": \"weather\", \"arguments\": {\"city\": \"Paris\"}}"
}
Note: At inference, parse the [Tool Call] JSON, execute the tool, and don’t show the JSON to the user.

Fix: Message Alternation Errors with Strict Models

Issue: Training data with consecutive same-role messages or system → assistant patterns (without a user message in between) failed on strict-alternation models like Gemma:
Conversation roles must alternate user/assistant/user/assistant/...
Root Cause: Some datasets have:
  • Consecutive assistant messages (e.g., multi-part responses)
  • System message followed directly by assistant (no user prompt)
  • Multiple user messages in a row
Fix: Added automatic message alternation fix that:
  1. Merges consecutive same-role messages (preserving content)
  2. Inserts placeholder [Continued] user messages when assistant follows system/assistant
  3. Only applies when the tokenizer rejects the format (dynamic detection)
Example transformation:
# Input with consecutive assistants
[
    {"role": "system", "content": "You are helpful"},
    {"role": "assistant", "content": "Hello!"},
    {"role": "assistant", "content": "How can I help?"}
]

# Output (auto-fixed)
[
    {"role": "system", "content": "You are helpful"},
    {"role": "user", "content": "[Continued]"},
    {"role": "assistant", "content": "Hello!\nHow can I help?"}
]
Note: This fix combines with the tool role fix below - both are applied automatically as needed.

Fix: Tool Role Messages Breaking Native Tokenizer Rendering

Issue: When training data contains tool role messages (from function calling), models that require strict user/assistant alternation (like Gemma) would fail with:
Conversation roles must alternate user/assistant/user/assistant/...
Root Cause: The TokenizerNativeRenderer passed messages directly to tokenizer.apply_chat_template() without preprocessing. Tokenizers like Gemma don’t support the tool role. Fix: Added smart tool role handling that:
  1. Detects if the tokenizer supports tool role by testing with a sample message (result is cached)
  2. Only converts tooluser with [Tool Result] prefix when the tokenizer doesn’t support it
  3. Preserves native tool handling for models that support it (Llama 3.1+, Mistral, etc.)
  4. Merges consecutive same-role messages to maintain strict alternation when needed
Example transformation (only for non-supporting models like Gemma):
# Input with tool role
[
    {"role": "user", "content": "What's 2+2?"},
    {"role": "assistant", "content": "Let me calculate"},
    {"role": "tool", "content": "4"},
    {"role": "assistant", "content": "The answer is 4"}
]

# Output for Gemma (auto-converted)
[
    {"role": "user", "content": "What's 2+2?"},
    {"role": "assistant", "content": "Let me calculate"},
    {"role": "user", "content": "[Tool Result] 4"},
    {"role": "assistant", "content": "The answer is 4"}
]

# Output for Llama 3.1+ (preserved as-is)
# Same as input - native tool support used
Affected models: Gemma 2, Gemma 3, Gemma 3n, and any model with strict alternation requirements. Models with native tool support are unaffected.

Fix: Chat Template “tokenizer” Incorrectly Using ChatML Format

Issue: When using --chat-template tokenizer (the default for SFT training), the system incorrectly used ChatML format instead of the model’s native chat template. This caused ChatML tokens (<|im_start|>, <|im_end|>) to be added as literal text in training data. Impact: Models trained with this bug learned to output ChatML tokens as regular text. For example, a Gemma model would output:
Response text<|im_end|><end_of_turn>
Instead of just:
Response text<end_of_turn>
Root Cause: In clm/utils.py, the chat format mapping had:
"tokenizer": "chatml",  # BUG - should be "native"
This caused ChatMLRenderer to be used (which adds ChatML tokens via string concatenation) instead of TokenizerNativeRenderer (which correctly uses tokenizer.apply_chat_template()). Fix: Changed the mapping to:
"tokenizer": "native",  # Use tokenizer's native apply_chat_template
Affected models: Any non-ChatML model trained with --chat-template tokenizer or the SFT trainer default. Retraining required: Models trained before this fix that exhibit ChatML token output need to be retrained.

Fix: HuggingFace Push Using Full Path as Repo Name (All Trainers)

Issue: When project_name was a full path like /workspace/trainings/my-model, pushing to HuggingFace Hub created an invalid repo ID like username//workspace/trainings/my-model. Fix: Now uses basename(project_name) to extract just the folder name, creating valid repo IDs like username/my-model. Affected trainers (all fixed):
  • CLM (LLM fine-tuning)
  • VLM (Vision-Language Models)
  • Text Classification
  • Text Regression
  • Token Classification
  • Sentence Transformers
  • Image Classification
  • Image Regression
  • Object Detection
  • Seq2Seq
  • Extractive QA
  • Tabular

Feature: —repo-id Parameter for Custom HuggingFace Destination

Added --repo-id CLI parameter to specify a custom HuggingFace repository destination. Useful for:
  • Pushing to an organization instead of your personal account
  • Using a different repo name than your local project_name
Usage:
# Push to organization
aitraining llm --train \
  --push-to-hub \
  --repo-id my-organization/my-model \
  --token $HF_TOKEN

# Push with custom name
aitraining llm --train \
  --push-to-hub \
  --repo-id username/production-model \
  --token $HF_TOKEN
When --repo-id is set, --username is not required since the repo ID already specifies the destination.

Feature: Post-Trial Actions for Hyperparameter Sweeps

Added ability to execute custom actions after each sweep trial completes. CLI Usage:
aitraining llm --train \
  --use-sweep \
  --post-trial-script 'if [ "$TRIAL_IS_BEST" = "true" ]; then git add . && git commit -m "Best model"; fi'
Environment Variables Available:
  • TRIAL_NUMBER - Trial index (0-based)
  • TRIAL_METRIC_VALUE - Metric value for this trial
  • TRIAL_IS_BEST - Whether this is the best trial so far (true/false)
  • TRIAL_OUTPUT_DIR - Output directory for the trial
  • TRIAL_PARAMS - Trial parameters as string
Python API:
from autotrain.utils import HyperparameterSweep, SweepConfig, TrialInfo

def on_trial_complete(trial_info: TrialInfo):
    if trial_info.is_best:
        save_checkpoint(trial_info.output_dir)

config = SweepConfig(
    parameters={"lr": (1e-5, 1e-3, "log_uniform")},
    post_trial_callback=on_trial_complete,
)

2026-01-06

Feature: —wandb-run-id Parameter for Run Resumption

Added --wandb-run-id CLI parameter to resume an existing W&B run instead of creating a new one. Useful when running AITraining from external W&B sweep agents. Usage:
autotrain llm --wandb-run-id abc123xyz ...
When set, AITraining automatically sets WANDB_RESUME=allow so the trainer resumes the specified run instead of creating a duplicate.

Fix: Duplicate W&B Runs in Sweeps

Issue: Each sweep trial was creating 2 W&B runs - one from the sweep code and one from the trainer. Root Cause: Sweep code called wandb.init(), then trainer also called wandb.init() internally, creating a duplicate run. Fix: After sweep’s wandb.init(), set WANDB_RUN_ID and WANDB_RESUME=allow env vars so the trainer resumes the same run instead of creating a new one.

Improvement: Better Error Message for Missing Text Column

When dataset has a messages column but training expects text, the error now suggests the fix:
Hint: Your dataset has a 'messages' column. Use --text-column messages for chat format data.

Fix: WANDB_PROJECT Using Path Instead of Name

Issue: Running sweeps with W&B logging failed with:
wandb.errors.UsageError: Invalid project name '/workspace/trainings/hotel-sft-optuna-v2': cannot contain characters '/,\\,#,?,%,:', found '/'
Root Cause: The fix in 0.0.10 for W&B sweep logging was using config.project_name (the output path) instead of just the project name when falling back. Fix: Use os.path.basename(config.project_name) to extract just the project name from the path.

Fix: Model Loaded in float32 Instead of bf16/fp16 on CUDA

Issue: When using mixed_precision=bf16 or fp16 on CUDA, the model was loaded in float32, causing 2x VRAM usage. Root Cause: The torch_dtype parameter wasn’t being passed to from_pretrained() in the CUDA code path. Only MPS had dtype conversion. Impact:
  • Model weights used 2x more VRAM than necessary
  • Training still worked (trainer used bf16 for compute), but was suboptimal
Fix: Added torch_dtype to model_kwargs when CUDA is available:
if torch.cuda.is_available():
    model_kwargs["device_map"] = "auto"
    if config.mixed_precision == "bf16":
        model_kwargs["torch_dtype"] = torch.bfloat16
    elif config.mixed_precision == "fp16":
        model_kwargs["torch_dtype"] = torch.float16

Fix: W&B Sweep Logs to Wrong Project

Issue: During sweeps with W&B logging, trainer runs were logged to the default “huggingface” project instead of the configured sweep project. Root Cause: The sweep created wandb.init() with the correct project, but the trainer’s internal wandb.init() didn’t know about it. Fix: Set WANDB_PROJECT and WANDB_ENTITY environment variables before calling the trainer, so any subsequent wandb.init() uses the correct project.

Fix: bitsandbytes CUDA 12.x Compatibility

Issue: Training with LoRA failed on CUDA 12.8 environments with:
CUDA SETUP: Required library version not found: libbitsandbytes_cuda128.so
RuntimeError: CUDA Setup failed despite GPU being available.
Root Cause: bitsandbytes 0.42.0 doesn’t have pre-compiled binaries for CUDA 12.8. Fix: Upgraded bitsandbytes from ==0.42.0 to >=0.45.0. Version 0.45.0+ uses a new multi-backend system that doesn’t require version-specific CUDA binaries. Commit: f13a068

2026-01-05

Feature: W&B Native Sweep Integration

Added native Weights & Biases sweep support for hyperparameter optimization. When enabled, sweep runs are grouped in W&B’s native sweep dashboard, providing aggregated views and parallel coordinates plots. New Parameters:
  • wandb_sweep: Enable W&B native sweep dashboard (default: false)
  • wandb_sweep_project: W&B project name for sweep (defaults to project_name)
  • wandb_sweep_entity: W&B entity (team/username) for sweep
  • wandb_sweep_id: Existing sweep ID to continue (skips creating new sweep)
Usage:
autotrain llm \
  --use-sweep \
  --sweep-backend optuna \
  --wandb-sweep \
  --wandb-sweep-project my-sweep-project \
  --wandb-sweep-entity my-team
When wandb_sweep is enabled, each trial run is linked to the sweep via wandb.init(group=sweep_id), creating an aggregated view in W&B. Commit: e49abc9

Fix: CLI Missing FIELD_SCOPES for W&B Sweep Parameters

Issue: Running autotrain llm --wandb-sweep via CLI failed with:
ValueError: Scope metadata is required for all fields but missing for: wandb_sweep, wandb_sweep_project, wandb_sweep_entity, wandb_sweep_id
Root Cause: The new W&B sweep parameters were added to LLMTrainingParams but not to FIELD_SCOPES in the CLI argument parser. Fix: Added the missing fields to FIELD_SCOPES and added a test to prevent this regression. Note: This only affected the CLI (autotrain llm ...). The Python API and TUI were not affected. Commit: 7994989

Fix: Sweep Parameters Accept Dict Format

Fixed sweep_params to accept both list and dict formats. Previously only list format worked, now both are supported:
# List format (always worked)
sweep_params = json.dumps({
    "batch_size": [2, 4, 8],
})

# Dict format (now works)
sweep_params = json.dumps({
    "lr": {"type": "loguniform", "low": 1e-5, "high": 1e-3},
    "batch_size": {"type": "categorical", "values": [2, 4, 8]},
    "warmup_ratio": {"type": "uniform", "low": 0.0, "high": 0.2},
})
Supported dict types: categorical, loguniform, uniform, int. Commit: 15aa38a

Fix: Auto-detect model_max_length from Model Config

Previously model_max_length defaulted to 2048 regardless of model capability, causing block_size to be silently capped even when the model supports longer sequences. The Problem:
  • Gemma 3 supports 32K-128K context (depending on variant), but block_size was capped to 2048
  • Users had to manually set --model-max-length to use longer sequences
The Fix:
  • Auto-detect max_position_embeddings from model config
  • Handles VLMs (reads from text_config) and regular LLMs
  • Falls back to 2048 with warning if auto-detect fails
  • User can still override with --model-max-length
# Before: block_size silently capped to 2048
aitraining llm --model google/gemma-3-4b-it --block-size 4096
# block_size was capped to 2048!

# After: auto-detects model context length, allows 4096
aitraining llm --model google/gemma-3-4b-it --block-size 4096
# block_size is 4096 as expected
Commit: 85bd37c

Dependency Update: Gemma 3n Support

Updated dependencies to support Gemma 3n and other new models:
  • transformers: 4.57.1 → 4.57.3
  • timm: 1.0.12 → 1.0.22 (adds mobilenetv5_300m_enc for Gemma 3n vision tower)
  • huggingface_hub: ==0.34.4 → >=0.34.0 (flexible constraint)
This enables support for Gemma 3n and other new models released in late 2024/2025.

2025-12-02

Bug Fix: ORPO Training Beta Parameter Not Applied

Issue: The dpo_beta parameter was not being passed to TRL’s ORPOConfig during ORPO training, causing user-specified beta values to be silently ignored. Impact: Users setting dpo_beta for ORPO training (e.g., dpo_beta=0.5) would have their setting ignored. ORPO would always use TRL’s default value of 0.1 regardless of user configuration. Root Cause: In train_clm_orpo.py, the code was missing the line to pass the beta parameter to ORPOConfig:
# Before (bug):
training_args["max_length"] = config.block_size
training_args["max_prompt_length"] = config.max_prompt_length  
training_args["max_completion_length"] = config.max_completion_length
args = ORPOConfig(**training_args)  # beta not passed!

# After (fix):
training_args["max_length"] = config.block_size
training_args["max_prompt_length"] = config.max_prompt_length
training_args["max_completion_length"] = config.max_completion_length
training_args["beta"] = config.dpo_beta  # Now correctly passed
args = ORPOConfig(**training_args)
Fix: Added training_args["beta"] = config.dpo_beta to ensure the user’s beta value is passed to ORPO training. Test Added: New test test_orpo_beta_parameter verifies that different beta values (0.01, 0.1, 0.5) are correctly applied during ORPO training. Commit: a37e288
For questions or issues, please open an issue on GitHub.