Make evaluation a repeatable loop, not a vibe check

Chris Harper

1 min read

Jun 5, 2026

Best Practices

LLM

An evaluation-driven workflow — Define, Test, Diagnose, Fix — turns stochastic LLM output into an engineering loop, anchored by a "Minimum Viable Evaluation Suite" tiered for plain apps, RAG, and agentic tool use. Counterintuitive finding: "better" prompts can hurt without an eval set to catch regressions.

Sources: arXiv: When "Better" Prompts Hurt · Empirical study of prompting techniques for SE tasks

CloudCodeTree

Make evaluation a repeatable loop, not a vibe check