Axolotl

Fine-tuning framework for LLMs. Config-driven: every training run is defined by a single YAML file.

Tech Stack

Python, PyTorch, HuggingFace Transformers, TRL, PEFT (LoRA/QLoRA), DeepSpeed, FSDP, vLLM (for GRPO generation).

Commands

axolotl train config.yaml              # Train (single or multi-GPU, auto-detected)
axolotl preprocess config.yaml         # Tokenize dataset and validate config
axolotl preprocess config.yaml --debug # Inspect tokenized samples and label masking
axolotl inference config.yaml          # Interactive inference
axolotl merge-lora config.yaml         # Merge LoRA adapter into base model
axolotl vllm-serve config.yaml         # Start vLLM server for GRPO/EBFT training
axolotl fetch examples                 # Download example configs
axolotl agent-docs                     # Show agent-optimized docs (bundled with pip package)
axolotl agent-docs grpo                # Topic-specific agent reference
axolotl config-schema                  # Dump config JSON schema

Training Methods

Method	Config Key	When to Use
SFT	(default)	Input-output pairs, instruction tuning
DPO/IPO	`rl: dpo` / `rl: dpo, dpo_loss_type: ["ipo"]`	Paired preference data (chosen vs rejected)
KTO	`rl: kto`	Unpaired binary preference labels
ORPO	`rl: orpo`	Single-stage alignment, no ref model
GRPO	`rl: grpo`	RL with verifiable reward functions (math, code)
EBFT	`rl: ebft`	Feature-matching rewards from internal representations

Agent-specific references:

docs/agents/sft.md — supervised fine-tuning
docs/agents/preference_tuning.md — DPO, IPO, KTO, ORPO, SimPO
docs/agents/grpo.md — GRPO online RL with reward functions
docs/agents/reward_modelling.md — outcome and process reward models
docs/agents/pretraining.md — continual pretraining
docs/agents/model_architectures.md — model-specific quirks (Gemma4, Qwen3.5 MoE, etc.)
docs/agents/new_model_support.md — debugging and adding support for new model architectures

Config Pattern

All training is config-driven. A YAML file specifies model, adapter, dataset(s), and hyperparameters:

base_model: meta-llama/Llama-3.1-8B-Instruct
adapter: lora                    # or qlora, or omit for full fine-tune
datasets:
  - path: my_dataset
    type: chat_template          # prompt strategy (see docs/dataset-formats/)
output_dir: ./outputs/lora-out

Config schema: src/axolotl/utils/schemas/config.py (AxolotlInputConfig).

Project Structure

src/axolotl/
  cli/                           # CLI entry points (train, preprocess, inference, merge_lora, vllm_serve)
  core/
    builders/                    # TrainerBuilder classes (causal.py for SFT, rl.py for RLHF)
    trainers/                    # Trainer classes, mixins (optimizer, scheduler, packing)
      dpo/                       # DPO trainer and config
      grpo/                      # GRPO trainer and sampler
  loaders/                       # Model, tokenizer, adapter, processor loading
  prompt_strategies/             # Dataset format handlers (chat_template, alpaca, dpo/, kto/, orpo/)
  utils/schemas/                 # Pydantic config schemas (config, model, training, peft, trl, fsdp)
  integrations/                  # Plugins (liger, cut_cross_entropy, swanlab, nemo_gym)
  monkeypatch/                   # Runtime patches for HF transformers

examples/                        # Example YAML configs by model (llama-3/, qwen2/, mistral/, ebft/)
deepspeed_configs/               # DeepSpeed JSON configs (zero2, zero3)
docs/                            # Quarto documentation site

Code Conventions

Config-driven: features are toggled via YAML, not code changes
Prompt strategies: src/axolotl/prompt_strategies/ — each type: value maps to a function
Plugin system: plugins: list in config loads integration modules
Trainer mixins: core/trainers/mixins/ for composable trainer behaviors
Schemas: all config validation via Pydantic in utils/schemas/

Comment Style

Default to no comment. Only add one when the WHY is non-obvious (hidden constraint, subtle invariant, workaround for a specific bug).
Don't explain WHAT the code does — names and types already do that.
Don't reference the current task, PR, or callers (e.g. "added for X", "used by Y", "fixes #123"). Those belong in commit messages / PR descriptions and rot fast.
Prefer one short line max.
Don't add planning/decision/analysis markdown files unless explicitly requested.

Key Documentation

Getting Started — quickstart tutorial
Choosing a Method — SFT vs DPO vs GRPO decision guide
Config Reference — all config options
Dataset Formats — chat_template, alpaca, input_output, completion
RLHF — DPO, KTO, ORPO, GRPO, EBFT configs and dataset formats
GRPO Deep Dive — async training, custom rewards, scaling
vLLM Serving — vLLM setup for GRPO/EBFT
Multi-GPU — FSDP and DeepSpeed
Training Stability — debugging loss, NaN, OOM
Debugging — VSCode setup, Docker debugging

Commands

axolotl train config.yaml # Train (single or multi-GPU, auto-detected) axolotl preprocess config.yaml # Tokenize dataset and validate config axolotl preprocess config.yaml --debug # Inspect tokenized samples and label masking axolotl inference config.yaml # Interactive inference axolotl merge-lora config.yaml # Merge LoRA adapter into base model axolotl vllm-serve config.yaml # Start vLLM server for GRPO/EBFT training axolotl fetch examples # Download example configs axolotl agent-docs # Show agent-optimized docs (bundled with pip package) axolotl agent-docs grpo # Topic-specific agent reference axolotl config-schema # Dump config JSON schema

Training Methods

Method

Config Key

When to Use

SFT

(default)

Input-output pairs, instruction tuning

DPO/IPO

rl: dpo / rl: dpo, dpo_loss_type: ["ipo"]

Paired preference data (chosen vs rejected)

KTO

rl: kto

Unpaired binary preference labels

ORPO

rl: orpo

Single-stage alignment, no ref model

GRPO

rl: grpo

RL with verifiable reward functions (math, code)

EBFT

rl: ebft

Feature-matching rewards from internal representations

Agent-specific references:

docs/agents/sft.md — supervised fine-tuning

docs/agents/preference_tuning.md — DPO, IPO, KTO, ORPO, SimPO

docs/agents/grpo.md — GRPO online RL with reward functions

docs/agents/reward_modelling.md — outcome and process reward models

docs/agents/pretraining.md — continual pretraining

docs/agents/model_architectures.md — model-specific quirks (Gemma4, Qwen3.5 MoE, etc.)

docs/agents/new_model_support.md — debugging and adding support for new model architectures

Config Pattern

All training is config-driven. A YAML file specifies model, adapter, dataset(s), and hyperparameters:

base_model: meta-llama/Llama-3.1-8B-Instruct adapter: lora # or qlora, or omit for full fine-tune datasets: - path: my_dataset type: chat_template # prompt strategy (see docs/dataset-formats/) output_dir: ./outputs/lora-out

Config schema: src/axolotl/utils/schemas/config.py (AxolotlInputConfig).

Project Structure

src/axolotl/ cli/ # CLI entry points (train, preprocess, inference, merge_lora, vllm_serve) core/ builders/ # TrainerBuilder classes (causal.py for SFT, rl.py for RLHF) trainers/ # Trainer classes, mixins (optimizer, scheduler, packing) dpo/ # DPO trainer and config grpo/ # GRPO trainer and sampler loaders/ # Model, tokenizer, adapter, processor loading prompt_strategies/ # Dataset format handlers (chat_template, alpaca, dpo/, kto/, orpo/) utils/schemas/ # Pydantic config schemas (config, model, training, peft, trl, fsdp) integrations/ # Plugins (liger, cut_cross_entropy, swanlab, nemo_gym) monkeypatch/ # Runtime patches for HF transformers examples/ # Example YAML configs by model (llama-3/, qwen2/, mistral/, ebft/) deepspeed_configs/ # DeepSpeed JSON configs (zero2, zero3) docs/ # Quarto documentation site

Code Conventions

Config-driven: features are toggled via YAML, not code changes

Prompt strategies: src/axolotl/prompt_strategies/ — each type: value maps to a function

Plugin system: plugins: list in config loads integration modules

Trainer mixins: core/trainers/mixins/ for composable trainer behaviors

Schemas: all config validation via Pydantic in utils/schemas/

Comment Style

Default to no comment. Only add one when the WHY is non-obvious (hidden constraint, subtle invariant, workaround for a specific bug).

Don't explain WHAT the code does — names and types already do that.

Don't reference the current task, PR, or callers (e.g. "added for X", "used by Y", "fixes #123"). Those belong in commit messages / PR descriptions and rot fast.

Prefer one short line max.

Don't add planning/decision/analysis markdown files unless explicitly requested.

Key Documentation

Getting Started — quickstart tutorial

Choosing a Method — SFT vs DPO vs GRPO decision guide

Config Reference — all config options

Dataset Formats — chat_template, alpaca, input_output, completion

RLHF — DPO, KTO, ORPO, GRPO, EBFT configs and dataset formats

GRPO Deep Dive — async training, custom rewards, scaling

vLLM Serving — vLLM setup for GRPO/EBFT

Multi-GPU — FSDP and DeepSpeed

Training Stability — debugging loss, NaN, OOM

Debugging — VSCode setup, Docker debugging

Axolotl

Axolotl

Tech Stack

Commands

Training Methods

Config Pattern

Project Structure

Code Conventions

Comment Style

Key Documentation

Related Skills

<h1 align="center">

Frontend Typescript Linting.mdc

2. Apply Deepthink Protocol (reason about dependencies

Axolotl

Tech Stack

Commands

Training Methods

Config Pattern

Project Structure

Code Conventions

Comment Style

Key Documentation