
Fine-Tune a 2B Open Model for Function Calling — Cut Tool-Call Costs 10x With TRL SFTTrainer
Chris Harper
2 min read
Jul 2, 2026 · 12:06 UTC
TL;DR: Use TRL's SFTTrainer to fine-tune a 2B model for function calling in one Colab afternoon — dedicate it to tool routing and cut API costs by 10x.
When your agent pipeline handles structured tool routing — pick a function, fill in args — a fine-tuned 2–8B open model matches proprietary APIs on that specific task at a fraction of the cost. TRL's SFTTrainer has native tool_calls column support (as of v0.24), so there's no preprocessing ceremony.
The data format
Each training example needs messages with tool calls and the tool schema:
{
"messages": [
{"role": "user", "content": "What's the weather in Tokyo?"},
{"role": "assistant", "tool_calls": [{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{"city": "Tokyo", "unit": "C"}"
}
}]}
],
"tools": [weather_tool_json_schema] # list of JSON schema objects
}
A ready-made dataset: NousResearch/hermes-function-calling-v1 (~11k instruction-following examples in Hermes chat format).
Training (free Colab T4 GPU)
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "Qwen/Qwen3-2B-Instruct"
dataset = load_dataset("NousResearch/hermes-function-calling-v1", split="train")
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
output_dir="./qwen-fn-caller",
num_train_epochs=2,
per_device_train_batch_size=2,
),
peft_config=LoraConfig(
r=16, lora_alpha=32, target_modules="all-linear", task_type="CAUSAL_LM"
),
)
trainer.train()
trainer.push_to_hub("your-org/qwen-fn-caller")
Runs in ~1–2 hours on a free Colab T4.
Serve the fine-tuned adapter with vLLM
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-2B-Instruct --enable-lora --lora-modules fn-caller=your-org/qwen-fn-caller --enable-auto-tool-choice --tool-call-parser qwen3
Pair this with vLLM's strict mode (covered in today's teachable post) to guarantee schema-valid arguments from every call.
When to reach for this pattern
- Your pipeline routes >1M tool calls/month — cost break-even vs. Claude/GPT arrives fast
- Your tool set is bounded and stable, not a dynamic schema that changes weekly
- Per-call latency matters and a 2B local model runs faster than an API round-trip
The HuggingFace agents-course bonus unit (linked below) includes the full walkthrough, including evaluation with a held-out test set to verify the fine-tune actually works.
Sources: HuggingFace agents-course: Fine-Tune for Function Calling | TRL SFTTrainer docs | hermes-function-calling-v1 dataset