Fine-Tune a 2B Open Model for Function Calling — Cut Tool-Call Costs 10x With TRL SFTTrainer

Chris Harper

2 min read

Jul 2, 2026 · 12:06 UTC

Workflow

Fine-Tuning

HuggingFace

TL;DR: Use TRL's SFTTrainer to fine-tune a 2B model for function calling in one Colab afternoon — dedicate it to tool routing and cut API costs by 10x.

When your agent pipeline handles structured tool routing — pick a function, fill in args — a fine-tuned 2–8B open model matches proprietary APIs on that specific task at a fraction of the cost. TRL's SFTTrainer has native tool_calls column support (as of v0.24), so there's no preprocessing ceremony.

The data format

Each training example needs messages with tool calls and the tool schema:

{
    "messages": [
        {"role": "user", "content": "What's the weather in Tokyo?"},
        {"role": "assistant", "tool_calls": [{
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": "{"city": "Tokyo", "unit": "C"}"
            }
        }]}
    ],
    "tools": [weather_tool_json_schema]   # list of JSON schema objects
}

A ready-made dataset: NousResearch/hermes-function-calling-v1 (~11k instruction-following examples in Hermes chat format).

Training (free Colab T4 GPU)

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "Qwen/Qwen3-2B-Instruct"
dataset = load_dataset("NousResearch/hermes-function-calling-v1", split="train")

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./qwen-fn-caller",
        num_train_epochs=2,
        per_device_train_batch_size=2,
    ),
    peft_config=LoraConfig(
        r=16, lora_alpha=32, target_modules="all-linear", task_type="CAUSAL_LM"
    ),
)
trainer.train()
trainer.push_to_hub("your-org/qwen-fn-caller")

Runs in ~1–2 hours on a free Colab T4.

Serve the fine-tuned adapter with vLLM

python -m vllm.entrypoints.openai.api_server   --model Qwen/Qwen3-2B-Instruct   --enable-lora   --lora-modules fn-caller=your-org/qwen-fn-caller   --enable-auto-tool-choice   --tool-call-parser qwen3

Pair this with vLLM's strict mode (covered in today's teachable post) to guarantee schema-valid arguments from every call.

When to reach for this pattern

Your pipeline routes >1M tool calls/month — cost break-even vs. Claude/GPT arrives fast
Your tool set is bounded and stable, not a dynamic schema that changes weekly
Per-call latency matters and a 2B local model runs faster than an API round-trip

The HuggingFace agents-course bonus unit (linked below) includes the full walkthrough, including evaluation with a held-out test set to verify the fine-tune actually works.

Sources: HuggingFace agents-course: Fine-Tune for Function Calling | TRL SFTTrainer docs | hermes-function-calling-v1 dataset

CloudCodeTree

Fine-Tune a 2B Open Model for Function Calling — Cut Tool-Call Costs 10x With TRL SFTTrainer

The data format

Training (free Colab T4 GPU)

Serve the fine-tuned adapter with vLLM

When to reach for this pattern