
Trace Every LLM Call and Catch Regressions Early: Evaluation and Observability with Langfuse
Chris Harper
3 min read
Jun 29, 2026 · 12:07 UTC
TL;DR: Langfuse is the open-source LLM observability platform — instrument any Python or TypeScript app in minutes to get traces, costs, quality scores, and dataset-based regression testing.
What you'll be able to do after this:
- Instrument any LLM app to capture every prompt, response, token cost, and latency in a structured trace dashboard
- Score individual outputs (manually or via LLM-as-judge) and track quality over time
- Spot regressions before users do by running your test dataset against a new model or prompt version
Why observability matters for LLM apps
With a REST API, a failed request has a status code, a latency, and a payload. With an LLM app, a "successful" request can still hallucinate, drift off-topic, or quietly regress after a prompt tweak. Langfuse makes the invisible visible: every call becomes a trace with the full prompt, the raw response, token counts, latency, and cost — all searchable and filterable in one dashboard.
Step 1: Install and connect
pip install langfuse
Set your credentials (get them from cloud.langfuse.com or your self-hosted instance):
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"
Step 2: Instrument with one import
The drop-in replacement wraps your existing OpenAI calls without changing any logic:
from langfuse.openai import openai # drop-in for "import openai"
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain RAG in one sentence"}]
)
Every call now appears as a trace in your dashboard with prompt, response, tokens, cost, and latency. No decorator required.
Step 3: Group calls into traces with @observe
For multi-step pipelines (retrieve → rerank → generate), decorate each function:
from langfuse.decorators import observe, langfuse_context
@observe()
def retrieve(query: str) -> list[str]:
# ... vector search ...
return chunks
@observe()
def generate(query: str, context: list[str]) -> str:
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "\n".join(context)},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
@observe()
def rag_pipeline(query: str) -> str:
chunks = retrieve(query)
return generate(query, chunks)
rag_pipeline("What is PagedAttention?")
The dashboard shows a tree: rag_pipeline → retrieve → generate, with timing and cost at each level.
Step 4: Score outputs for quality tracking
After a user rates an answer (or you run an automated eval), attach a score to the trace:
from langfuse import Langfuse
langfuse = Langfuse()
langfuse.score(
trace_id="<trace-id>", # from langfuse_context.get_current_trace_id()
name="answer_quality",
value=0.9, # 0.0 – 1.0
comment="accurate but slightly verbose"
)
Quality scores show up alongside traces and can be filtered/aggregated — so you can see if a prompt change moved average quality up or down.
Step 5: Self-host for free (optional)
git clone https://github.com/langfuse/langfuse
docker compose up
Langfuse v3 runs six containers (web, worker, Postgres, ClickHouse, Redis, MinIO) and is free to self-host forever. Point LANGFUSE_HOST at http://localhost:3000.
Watch the 10-minute walkthrough
10 min Walkthrough of Langfuse — YouTube covers the full UI: trace view, session grouping, scores, prompt management, and dataset experiments. Best starting point before reading the docs.
Sources: Langfuse get-started docs | Langfuse GitHub | 10-min YouTube walkthrough | Towards Data Science: Hands-on with Langfuse