← System Design Library / architecture-eval-driven-development
Stable v1.0 Last Updated: 1/5/2026

Evaluation-Driven Development (EDD)

Moving from 'Vibe Checks' to metrics. A comprehensive framework for testing stochastic AI systems.

Summary

  • What it is: A methodology where the evaluation pipeline (Evals) drives the development loop, replacing manual testing.
  • Why now: Stochastic systems (LLMs) cannot be verified with traditional binary unit tests.
  • Who it’s for: AI Engineers and Senior Technical leaders establishing reliability standards.

The “Vibe Check” Trap

In traditional software, assert(result == expected) is binary. In GenAI, the output is probabilistic. Most teams rely on manual “Vibe Checks” — chatting with the bot to see if it “feels right.” This doesn’t scale.

EDD Mandate: We cannot ship software we cannot measure.


The Hierarchy of Evals

Level 1: Deterministic (Unit Tests)

  • Check: JSON schema validity, forbidden keywords, tool call structure.
  • Cost: Cheap ($0).
  • Run: On every commit (CI).

Level 2: Model-Graded (LLM-as-a-Judge)

  • Check: “Is the answer faithful to the context?”, “Is the tone helpful?”
  • Cost: Moderate (LLM API calls).
  • Run: Nightly or Pre-Release.
  • Tools: RAGAS, DeepEval.

Level 3: Human (Golden Datasets)

  • Check: Ground truth correctness for complex reasoning.
  • Cost: High (Expert time).
  • Run: Weekly/Monthly to calibrate Level 2 judges.

The RAG Triad Metrics

For Retrieval systems, we measure three couplings:

  1. Context Relevance: Did we retrieve the right docs?
  2. Faithfulness: Is the answer supported by the docs? (Hallucination check)
  3. Answer Relevance: Did we answer the user’s question?

The Optimization Flywheel

graph TD
    Prod[Production Logs] -->|Sample Failures| Dataset[Golden Dataset]
    Dataset -->|Run Evals| Metrics[Eval Report]
    Metrics -->|Analyze| Tuning[Prompt/Code Changes]
    Tuning -->|Verify| Regression[Regression Test]
    Regression -->|Pass| Prod

Strategic Insight: The team with the fastest loop from “Production Failure” to “New Test Case” wins.


Implementation Strategy

  1. Start Small: Create a golden_dataset.jsonl with 20 diverse examples.
  2. Automate: Add a GitHub Action that runs pytest with a simple LLM judge.
  3. Monitor: Use a supervisor model (Sentinel) to sample 1% of production traffic for drift.

  • Sentinel: Can be used as an online evaluator/supervisor.
  • MonitorX: For capturing the production traces to feed the dataset.