Trace Every LLM Call and Catch Regressions Early: Evaluation and Observability with Langfuse

Chris Harper

3 min read

Jun 29, 2026 · 12:07 UTC

Tutorial

Agents

LLM

TL;DR: Langfuse is the open-source LLM observability platform — instrument any Python or TypeScript app in minutes to get traces, costs, quality scores, and dataset-based regression testing.

What you'll be able to do after this:

Instrument any LLM app to capture every prompt, response, token cost, and latency in a structured trace dashboard
Score individual outputs (manually or via LLM-as-judge) and track quality over time
Spot regressions before users do by running your test dataset against a new model or prompt version

Why observability matters for LLM apps

With a REST API, a failed request has a status code, a latency, and a payload. With an LLM app, a "successful" request can still hallucinate, drift off-topic, or quietly regress after a prompt tweak. Langfuse makes the invisible visible: every call becomes a trace with the full prompt, the raw response, token counts, latency, and cost — all searchable and filterable in one dashboard.

Step 1: Install and connect

pip install langfuse

Set your credentials (get them from cloud.langfuse.com or your self-hosted instance):

export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"

Step 2: Instrument with one import

The drop-in replacement wraps your existing OpenAI calls without changing any logic:

from langfuse.openai import openai   # drop-in for "import openai"

response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain RAG in one sentence"}]
)

Every call now appears as a trace in your dashboard with prompt, response, tokens, cost, and latency. No decorator required.

Step 3: Group calls into traces with @observe

For multi-step pipelines (retrieve → rerank → generate), decorate each function:

from langfuse.decorators import observe, langfuse_context

@observe()
def retrieve(query: str) -> list[str]:
    # ... vector search ...
    return chunks

@observe()
def generate(query: str, context: list[str]) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "\n".join(context)},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

@observe()
def rag_pipeline(query: str) -> str:
    chunks = retrieve(query)
    return generate(query, chunks)

rag_pipeline("What is PagedAttention?")

The dashboard shows a tree: rag_pipeline → retrieve → generate, with timing and cost at each level.

Step 4: Score outputs for quality tracking

After a user rates an answer (or you run an automated eval), attach a score to the trace:

from langfuse import Langfuse
langfuse = Langfuse()

langfuse.score(
    trace_id="<trace-id>",   # from langfuse_context.get_current_trace_id()
    name="answer_quality",
    value=0.9,               # 0.0 – 1.0
    comment="accurate but slightly verbose"
)

Quality scores show up alongside traces and can be filtered/aggregated — so you can see if a prompt change moved average quality up or down.

Step 5: Self-host for free (optional)

git clone https://github.com/langfuse/langfuse
docker compose up

Langfuse v3 runs six containers (web, worker, Postgres, ClickHouse, Redis, MinIO) and is free to self-host forever. Point LANGFUSE_HOST at http://localhost:3000.

Watch the 10-minute walkthrough

10 min Walkthrough of Langfuse — YouTube covers the full UI: trace view, session grouping, scores, prompt management, and dataset experiments. Best starting point before reading the docs.

Sources: Langfuse get-started docs | Langfuse GitHub | 10-min YouTube walkthrough | Towards Data Science: Hands-on with Langfuse

CloudCodeTree