Evaluation-Driven Development (EDD)

Summary

What it is: A methodology where the evaluation pipeline (Evals) drives the development loop, replacing manual testing.
Why now: Stochastic systems (LLMs) cannot be verified with traditional binary unit tests.
Who it’s for: AI Engineers and Senior Technical leaders establishing reliability standards.

The “Vibe Check” Trap

In traditional software, assert(result == expected) is binary. In GenAI, the output is probabilistic. Most teams rely on manual “Vibe Checks” — chatting with the bot to see if it “feels right.” This doesn’t scale.

EDD Mandate: We cannot ship software we cannot measure.

The Hierarchy of Evals

Level 1: Deterministic (Unit Tests)

Check: JSON schema validity, forbidden keywords, tool call structure.
Cost: Cheap ($0).
Run: On every commit (CI).

Level 2: Model-Graded (LLM-as-a-Judge)

Check: “Is the answer faithful to the context?”, “Is the tone helpful?”
Cost: Moderate (LLM API calls).
Run: Nightly or Pre-Release.
Tools: RAGAS, DeepEval.

Level 3: Human (Golden Datasets)

Check: Ground truth correctness for complex reasoning.
Cost: High (Expert time).
Run: Weekly/Monthly to calibrate Level 2 judges.

The RAG Triad Metrics

For Retrieval systems, we measure three couplings:

Context Relevance: Did we retrieve the right docs?
Faithfulness: Is the answer supported by the docs? (Hallucination check)
Answer Relevance: Did we answer the user’s question?

The Optimization Flywheel

graph TD
    Prod[Production Logs] -->|Sample Failures| Dataset[Golden Dataset]
    Dataset -->|Run Evals| Metrics[Eval Report]
    Metrics -->|Analyze| Tuning[Prompt/Code Changes]
    Tuning -->|Verify| Regression[Regression Test]
    Regression -->|Pass| Prod

Strategic Insight: The team with the fastest loop from “Production Failure” to “New Test Case” wins.

Implementation Strategy

Start Small: Create a golden_dataset.jsonl with 20 diverse examples.
Automate: Add a GitHub Action that runs pytest with a simple LLM judge.
Monitor: Use a supervisor model (Sentinel) to sample 1% of production traffic for drift.

Sentinel: Can be used as an online evaluator/supervisor.
MonitorX: For capturing the production traces to feed the dataset.