Stable v1.0 Last Updated: 1/5/2026
Evaluation-Driven Development (EDD)
Moving from 'Vibe Checks' to metrics. A comprehensive framework for testing stochastic AI systems.
Summary
- What it is: A methodology where the evaluation pipeline (Evals) drives the development loop, replacing manual testing.
- Why now: Stochastic systems (LLMs) cannot be verified with traditional binary unit tests.
- Who it’s for: AI Engineers and Senior Technical leaders establishing reliability standards.
The “Vibe Check” Trap
In traditional software, assert(result == expected) is binary. In GenAI, the output is probabilistic.
Most teams rely on manual “Vibe Checks” — chatting with the bot to see if it “feels right.” This doesn’t scale.
EDD Mandate: We cannot ship software we cannot measure.
The Hierarchy of Evals
Level 1: Deterministic (Unit Tests)
- Check: JSON schema validity, forbidden keywords, tool call structure.
- Cost: Cheap ($0).
- Run: On every commit (CI).
Level 2: Model-Graded (LLM-as-a-Judge)
- Check: “Is the answer faithful to the context?”, “Is the tone helpful?”
- Cost: Moderate (LLM API calls).
- Run: Nightly or Pre-Release.
- Tools: RAGAS, DeepEval.
Level 3: Human (Golden Datasets)
- Check: Ground truth correctness for complex reasoning.
- Cost: High (Expert time).
- Run: Weekly/Monthly to calibrate Level 2 judges.
The RAG Triad Metrics
For Retrieval systems, we measure three couplings:
- Context Relevance: Did we retrieve the right docs?
- Faithfulness: Is the answer supported by the docs? (Hallucination check)
- Answer Relevance: Did we answer the user’s question?
The Optimization Flywheel
graph TD
Prod[Production Logs] -->|Sample Failures| Dataset[Golden Dataset]
Dataset -->|Run Evals| Metrics[Eval Report]
Metrics -->|Analyze| Tuning[Prompt/Code Changes]
Tuning -->|Verify| Regression[Regression Test]
Regression -->|Pass| Prod
Strategic Insight: The team with the fastest loop from “Production Failure” to “New Test Case” wins.
Implementation Strategy
- Start Small: Create a
golden_dataset.jsonlwith 20 diverse examples. - Automate: Add a GitHub Action that runs
pytestwith a simple LLM judge. - Monitor: Use a supervisor model (Sentinel) to sample 1% of production traffic for drift.
Related Projects
- Sentinel: Can be used as an online evaluator/supervisor.
- MonitorX: For capturing the production traces to feed the dataset.