Fine-Tuning & Serving: Fine-Tuning vs RAG

Fine-Tuning & Serving · Part 1 of 3

The Fine-Tuning & Serving course · 3 parts

Change the model itself: when to fine-tune vs. retrieve, how LoRA/QLoRA work, and how to serve the result behind an OpenAI-compatible API.

01Fine-Tuning vs RAG← you are here

02LoRA & QLoRA on One GPU
03Serve a Model with vLLM

Fine-Tuning vs RAG

TL;DR: RAG changes what the model knows at query time; fine-tuning changes how the model behaves by adjusting its weights. Reach for retrieval first — it's cheaper, updatable, and citable. Fine-tune when you need a behavior, format, or style that no prompt reliably produces.

The whole RAG track built one half of applied AI engineering: getting the right knowledge in front of the model. This tutorial is the fork in the road to the other half — changing the model itself. It's the most common architecture decision you'll make, and the most commonly gotten wrong (teams fine-tune to "add knowledge," which is the one thing it's bad at).

This one is conceptual — no GPU, no notebook. It's the map for the two tutorials that follow.

The one-sentence rule

Retrieve for knowledge. Fine-tune for behavior.

Knowledge = facts, documents, things that change. Who is our Q3 enterprise contact? What does our refund policy say? This belongs in retrieval — it updates the instant the source does, and the model can cite it.
Behavior = format, tone, structure, a skill the base model does inconsistently. Always answer as a strict JSON object. Write in our terse house voice. Classify support tickets into our 9 internal categories. This is what fine-tuning bakes in.

Why not just fine-tune in the knowledge?

Because it fails at exactly the things retrieval is good at:

| | RAG (retrieval) | Fine-tuning | |---|---|---| | Update a fact | Edit the source; live immediately | Retrain to change anything | | Cite a source | Yes — you have the passage | No — knowledge is diffuse in weights | | New/changing data | Ideal | Stale the moment data moves | | Consistent format/voice | Fragile (prompt-dependent) | Ideal — it's learned | | Cost to set up | Low (an afternoon) | GPU time + a labeled dataset | | Risk | Retrieve wrong passage | Hallucinate confidently; forget skills |

Fine-tuning teaches a pattern, not a fact. Train it on 1,000 Q&A pairs about your product and it learns to sound like product Q&A — it will happily invent a confident, fluent, wrong answer about a feature that shipped last week. That's not a bug you can prompt away.

NOTE

They compose — and the best systems use both. Fine-tune a model to follow your format and use retrieved context well, then feed it fresh facts via RAG at query time. The model brings the behavior; retrieval brings the knowledge. "RAG vs fine-tuning" is usually really "RAG and fine-tuning."

A decision checklist

Ask, in order:

Can a better prompt do it? Few-shot examples and a clear output spec solve a surprising amount. Always try this first — it's free and instant.
Is the gap knowledge? → RAG. (the whole track you just finished).
Is the gap behavior a prompt can't reliably hit — strict format, domain tone, a narrow classification skill? → fine-tune.
Both? Fine-tune the behavior, retrieve the knowledge.

TIP

Order matters. Prompt → RAG → fine-tune is also the order of increasing cost and decreasing flexibility. Don't fine-tune away a problem a system prompt and three examples would have fixed. Most "we need to fine-tune" instincts are solved one rung lower.

What fine-tuning actually changes

Modern fine-tuning rarely touches all the weights. Parameter-efficient fine-tuning (PEFT) — usually LoRA — freezes the base model and trains a tiny set of new adapter weights instead. That's what makes it feasible on one GPU, and it's the subject of the next tutorial. The official starting point is Hugging Face's PEFT quicktour; a LoRA adapter for a 350M model is just a few megabytes.

Where this goes next

LoRA & QLoRA: Fine-Tune on One GPU — how parameter-efficient fine-tuning works, run in a free Colab notebook.
Serve Your Fine-Tuned Model with vLLM — take the result and put it behind an OpenAI-compatible API.

The rule to carry forward: retrieve for knowledge, fine-tune for behavior — and measure before you trust either.