February 9, 2026

Evals-Driven Development: How It Actually Works in Practice

Everyone says 'use evals' to ship better AI. Almost nobody explains what that means. Here's the practical workflow for evals-driven development in production.

Sebastian Mondragon

10 min read

TL;DR

Evals-driven development means making eval results the gate for every AI system change—prompt edits, model swaps, retrieval tweaks, nothing ships without data. Start with 30-50 test cases built from real production queries. Define scoring criteria (exact match for factual tasks, LLM-as-judge for subjective quality). Build a simple runner that compares before/after results. Follow the loop: write the eval first, make the change, run the suite, analyze regressions, then ship or iterate. Use tiered evaluation—fast smoke tests on every commit, comprehensive evals before merge, full regression weekly. The hard part isn't tooling. It's building the discipline to never skip the eval step, even when the change 'obviously' works.

"Just use evals." If you've spent any time in AI engineering circles over the past year, you've heard this advice roughly a thousand times. Conference talks, blog posts, podcasts—the consensus is clear: evals-driven development is how serious teams build AI products.

There's just one problem. Ask most teams what evals-driven development actually looks like on a Tuesday afternoon, and you get vague gestures toward "testing your prompts" or "measuring accuracy." The phrase has become one of those ideas everyone agrees with and almost nobody implements with real rigor.

A client came to us after a prompt change knocked out accurate responses for 30% of their edge cases in production. They had no eval suite. The change "looked good" in manual testing—five queries, all reasonable outputs. It shipped. Three days later, support tickets revealed the damage. That's what development without evals actually costs.

Here's what evals-driven development looks like when you do it properly.

What Evals-Driven Development Actually Means

Evals-driven development is a workflow where evaluation results are the primary gate for every change to your AI system. Not vibes. Not "it seems to work." Not a demo to the product manager. Data.

The concept borrows from test-driven development, but the analogy only goes so far. In TDD, you write a failing test, write code to pass it, and refactor. The test is deterministic—it passes or fails. AI evals are fundamentally different because your system is non-deterministic. The same prompt with the same input can produce different outputs across runs. You're not checking for exact correctness. You're measuring quality distributions.

In practice, evals-driven development means this: before you change a prompt, swap a model, adjust retrieval parameters, or modify any component of your AI pipeline, you define what "better" looks like in measurable terms. Then you run your change against that measurement. Then you look at the data before deciding to ship.

This applies to everything—not just major model upgrades. A one-word prompt edit can cascade into unexpected regressions. A temperature adjustment can shift response quality in ways that feel subtle during manual testing but show clear degradation across 200 eval cases. The discipline is treating every change as potentially impactful and requiring evidence before deployment.

The difference between teams that say they do evals-driven development and teams that actually do it comes down to one thing: whether the eval suite runs before every merge, or whether it's something the team means to run when they remember.

Why Traditional Testing Breaks Down for AI

Software engineers instinctively reach for unit tests when they hear "testing." Write an assertion, check the output, move on. This approach fundamentally doesn't work for AI systems, and understanding why is essential to building effective evals.

Non-determinism is the default

Traditional tests assume deterministic behavior: given input X, the function returns Y. Always. AI models produce different outputs on every call, even with identical inputs. A language model might answer a question correctly 85% of the time and produce a subtly wrong answer 15% of the time. A single test run tells you nothing about that distribution. You need multiple runs, statistical aggregation, and tolerance thresholds—not binary assertions.

Quality is multi-dimensional

A software function either returns the right value or it doesn't. An AI response can be accurate but unhelpful, helpful but inaccurate, well-formatted but factually wrong, correct but missing critical context. You can't collapse quality into a binary pass/fail. Effective evals score across multiple dimensions—accuracy, relevance, completeness, tone, safety—because a change that improves one dimension frequently degrades another.

Failure modes are emergent

Traditional software bugs are reproducible: the same input triggers the same failure. AI failures emerge from combinations of inputs, context, and model behavior that you can't predict from individual test cases. A prompt change might work perfectly for short queries but break on long ones. A model swap might improve formal language but degrade casual conversation. These interaction effects only surface when you evaluate across diverse, representative test sets. Five spot-checked examples won't reveal them. This is why teams that try to "unit test" their AI systems end up with either fragile tests that break constantly or meaningless tests that pass while quality degrades. Evals-driven development requires a fundamentally different evaluation approach.

Building Your First Eval Suite From Scratch

You don't need a framework, a platform, or a month of setup. You need test cases, scoring criteria, and a script that runs them.

Collect real examples

Your eval cases should come from production, not imagination. Export actual user queries from logs, support tickets, or usage analytics. If you're pre-launch, collect questions from internal stakeholders, beta users, or subject matter experts who represent your target users. The goal is test cases that reflect how people actually interact with your system—including the messy, ambiguous, poorly-phrased inputs that real users submit. Start with 30-50 cases. Prioritize coverage over volume. Ensure you have examples from each major use case, including 5-10 edge cases that have caused problems or that you suspect could cause problems.

Define scoring criteria

For each eval case, define what a good response looks like. This varies by system type: Don't aim for perfection in scoring criteria upfront. Start with what's practical and refine as you learn which dimensions matter most. A simple three-point scale—good, acceptable, bad—catches more regressions than a sophisticated rubric you never implement.

Factual Q&A: Does the response contain the correct answer? Score with exact match or semantic similarity against a reference answer.
Summarization: Does the summary capture key points without hallucination? Use an LLM-as-judge with a scoring rubric.
Classification: Does the output match the expected category? Binary scoring works here.
Conversational agents: Is the response helpful, on-topic, and appropriately toned? Multi-dimensional LLM-as-judge scoring across separate criteria.

Build a simple runner

A Python script that iterates through test cases, calls your AI system, applies scoring, and outputs a results table is enough to start. Store results in a structured format so you can compare runs over time. The infrastructure can be minimal as long as it runs consistently and produces comparable results. For detailed guidance on building representative test datasets, see our article on building evaluation datasets for business AI.

The Evals-Driven Development Loop

Once your eval suite exists, the workflow follows a consistent pattern. This is where the "driven" part actually happens.

Write the eval first

Before making any change, define what improvement you expect. Adding a new system prompt instruction? Write eval cases that test the specific behavior you're introducing. Swapping to a cheaper model? Define the quality floor you won't accept dropping below. Adjusting retrieval to return more chunks? Add cases that test whether additional context helps or hurts answer quality. This step is where most teams skip ahead. They make the change, then figure out how to evaluate it. That approach introduces confirmation bias—you're looking for evidence the change works rather than evidence it doesn't. Writing the eval first forces you to commit to success criteria before you see the results.

Establish the baseline

Run your current system against the full eval suite. Record scores across all dimensions. This baseline is your comparison point. Without it, you can't distinguish improvement from regression. Store baseline results persistently—you'll reference them repeatedly as you iterate.

Make one change at a time

Modify your system: edit the prompt, swap the model, adjust retrieval parameters. Resist the urge to bundle multiple changes. When you change three things simultaneously, you can't attribute improvements or regressions to any specific modification. Isolating changes makes debugging straightforward when something breaks.

Run evals and analyze results

Run the modified system against the same eval suite. Compare against the baseline. Look for three things: That third point matters most. A 2% average improvement that masks a 15% degradation on your most important use case is not a win. Always inspect individual case results, not just aggregates.

1. Overall score changes: Did aggregate quality go up or down?
2. Dimension-specific shifts: Did accuracy improve while completeness dropped?
3. Case-specific regressions: Which individual cases got worse?

Ship or iterate

If evals show improvement without meaningful regressions, ship with confidence. If they reveal regressions, investigate. Sometimes the regression is acceptable—a model swap that saves 60% on inference costs with a 3% quality drop might be the right business decision. But you're making that decision with data, not hope. That's the entire point.

What Good Evals Look Like Across Different Systems

Different AI systems need different eval strategies. One-size-fits-all evaluation misses what matters for your specific application.

RAG systems

RAG evals need to separate retrieval quality from generation quality. Measure retrieval with precision@k and recall@k against labeled ground-truth documents. Measure generation with faithfulness checks—does the response stick to retrieved context?—and relevance scoring—does it actually answer the question? A change that improves generation but degrades retrieval needs a different fix than one that does the opposite. Collapsing both into a single score hides the root cause. For detailed RAG evaluation approaches, see our guide on telling whether your RAG system actually works.

Classification pipelines

Classification evals are the most straightforward: compare predicted labels against ground truth. But go beyond overall accuracy. Measure per-class performance, especially for rare but important classes. A model that confuses "urgent" with "normal" priority is a bigger production problem than one that confuses "low" with "normal," even if both errors contribute equally to the overall error rate.

Conversational agents

Conversational evals require multi-turn evaluation, not just single-response scoring. Does the agent maintain context across turns? Does it handle clarification requests? Does it recover gracefully from misunderstandings? Score individual turns and conversation-level coherence separately. LLM-as-judge works well here with rubrics that assess helpfulness, consistency, and appropriate escalation to human support.

Code generation tools

Code evals have a unique advantage: you can execute the output. Run generated code against test suites and measure pass rates. But also evaluate code quality—readability, efficiency, security practices. A function that passes all tests but introduces a SQL injection vulnerability isn't a success. Combine execution-based scoring with quality assessments.

Scaling Evals Without Drowning in Maintenance

Eval suites that grow without discipline become burdens that teams eventually abandon. Sustainability matters as much as coverage.

Use a tiered evaluation strategy

Not every change needs every eval. Structure your suite into tiers: This tiered approach prevents evals from becoming a bottleneck while maintaining rigor where it counts.

Smoke tests (10-20 cases): Your most critical scenarios. Run on every commit. Complete in under 2 minutes.
Standard suite (50-100 cases): Comprehensive coverage of all major use cases. Run before merging to main. Target under 15 minutes.
Full regression (200+ cases): Complete coverage with human review of flagged results. Run weekly or before releases.

Automate dataset expansion

Add eval cases from production failures automatically. When a user reports a bad response, that query becomes an eval case with the corrected expected output. Over time, your eval suite organically covers the scenarios that actually matter—not just the ones you imagined during initial setup.

Manage evaluation costs

LLM-as-judge evaluations cost money. Use cheaper models for routine eval runs and reserve expensive models for periodic calibration against human judgment. Cache judge responses for unchanged outputs. Most teams spend under $50 per month on eval infrastructure with smart model selection and caching strategies. For a deeper look at balancing automated and human evaluation, see our analysis of human evaluation versus automated metrics.

Version your eval suite

Tag eval suite versions alongside code releases. When you add or modify eval cases, document the changes. This lets you compare model performance across consistent eval versions and prevents confusion when scores shift because the measurement changed rather than the system.

Mistakes That Sabotage Evals-Driven Workflows

Even committed teams make these errors regularly. Recognizing them early saves months of misdirected effort.

Evaluating on unrealistic data

An eval suite built from clean, well-formed, ideal inputs will show high scores while production quality suffers on the messy queries real users submit. Continuously refresh eval cases from production logs. If your eval cases don't include typos, ambiguous phrasing, and incomplete context, they don't represent reality.

Optimizing for the eval instead of the user

When the eval suite becomes the target, teams start gaming scores rather than improving user experience. This is Goodhart's Law applied to AI development. A model that scores 95% on your eval but frustrates actual users has a flawed eval, not a great model. Periodically validate that eval scores correlate with real user satisfaction metrics—support ticket volume, user retention, task completion rates.

Skipping evals for "small" changes

The most dangerous changes are the ones that seem too small to evaluate. A minor prompt tweak. A temperature adjustment. A system message clarification. These "trivial" changes regularly cause outsized regressions because they affect every response the system generates. The discipline of evals-driven development means running evals especially when the change feels insignificant.

Never updating the eval suite

An eval suite created six months ago probably doesn't represent your system's current usage patterns. User behavior evolves, product features change, and new failure modes emerge. Stale eval suites provide false confidence. Budget time for monthly eval suite reviews and treat your evaluation dataset as living infrastructure, not a one-time artifact. For guidance on comprehensive AI system auditing, see our article on AI audits for bugs, bias, and performance.

Making Evals-Driven Development Your Default

The gap between "we should use evals" and "we actually use evals" is smaller than most teams think. You don't need perfect tooling, thousands of test cases, or a dedicated evaluation team. You need 30-50 representative cases, a scoring approach that fits your use case, and the discipline to run the suite before every change ships.

Evals-driven development isn't a testing strategy. It's a decision-making framework. Every change to your AI system generates data, and that data either supports shipping the change or it doesn't. Teams that build this habit ship faster, break less, and build AI products their users actually trust.

Start today. Pick your most important use case, collect 30 real queries, define what good responses look like, and build a script that measures quality. Run it before your next prompt change. That's evals-driven development. Everything else is conversation.

Frequently Asked Questions

Quick answers to common questions about this topic

Traditional testing uses deterministic assertions—input X should produce output Y. Evals-driven development uses probabilistic evaluation because AI outputs vary between runs. Instead of checking for exact matches, you measure quality across dimensions like accuracy, relevance, and faithfulness using scoring rubrics, LLM-as-judge approaches, or statistical thresholds. The eval suite becomes the decision-making mechanism for every change, not just a post-deployment check.

February 9, 2026

Evals-Driven Development: How It Actually Works in Practice

Everyone says 'use evals' to ship better AI. Almost nobody explains what that means. Here's the practical workflow for evals-driven development in production.

Sebastian Mondragon

10 min read

TL;DR

Here's what evals-driven development looks like when you do it properly.

What Evals-Driven Development Actually Means

Evals-driven development is a workflow where evaluation results are the primary gate for every change to your AI system. Not vibes. Not "it seems to work." Not a demo to the product manager. Data.

Why Traditional Testing Breaks Down for AI

Non-determinism is the default

Quality is multi-dimensional

Failure modes are emergent

Building Your First Eval Suite From Scratch

You don't need a framework, a platform, or a month of setup. You need test cases, scoring criteria, and a script that runs them.

Collect real examples

Define scoring criteria

Factual Q&A: Does the response contain the correct answer? Score with exact match or semantic similarity against a reference answer.
Summarization: Does the summary capture key points without hallucination? Use an LLM-as-judge with a scoring rubric.
Classification: Does the output match the expected category? Binary scoring works here.
Conversational agents: Is the response helpful, on-topic, and appropriately toned? Multi-dimensional LLM-as-judge scoring across separate criteria.

Build a simple runner

The Evals-Driven Development Loop

Once your eval suite exists, the workflow follows a consistent pattern. This is where the "driven" part actually happens.

Write the eval first

Establish the baseline

Make one change at a time

Run evals and analyze results

1. Overall score changes: Did aggregate quality go up or down?
2. Dimension-specific shifts: Did accuracy improve while completeness dropped?
3. Case-specific regressions: Which individual cases got worse?

Ship or iterate

What Good Evals Look Like Across Different Systems

Different AI systems need different eval strategies. One-size-fits-all evaluation misses what matters for your specific application.

RAG systems

Classification pipelines

Conversational agents

Code generation tools

Scaling Evals Without Drowning in Maintenance

Eval suites that grow without discipline become burdens that teams eventually abandon. Sustainability matters as much as coverage.

Use a tiered evaluation strategy

Not every change needs every eval. Structure your suite into tiers: This tiered approach prevents evals from becoming a bottleneck while maintaining rigor where it counts.

Smoke tests (10-20 cases): Your most critical scenarios. Run on every commit. Complete in under 2 minutes.
Standard suite (50-100 cases): Comprehensive coverage of all major use cases. Run before merging to main. Target under 15 minutes.
Full regression (200+ cases): Complete coverage with human review of flagged results. Run weekly or before releases.

Automate dataset expansion

Manage evaluation costs

Version your eval suite

Mistakes That Sabotage Evals-Driven Workflows

Even committed teams make these errors regularly. Recognizing them early saves months of misdirected effort.

What Evals-Driven Development Actually Means

Why Traditional Testing Breaks Down for AI

Non-determinism is the default

Quality is multi-dimensional

Failure modes are emergent

Building Your First Eval Suite From Scratch

Collect real examples

Define scoring criteria

Build a simple runner

The Evals-Driven Development Loop

Write the eval first

Establish the baseline

Make one change at a time

Run evals and analyze results

Ship or iterate

What Good Evals Look Like Across Different Systems

RAG systems

Classification pipelines

Conversational agents

Code generation tools

Scaling Evals Without Drowning in Maintenance

Use a tiered evaluation strategy

Automate dataset expansion

Manage evaluation costs

Version your eval suite

Mistakes That Sabotage Evals-Driven Workflows

Evaluating on unrealistic data

Optimizing for the eval instead of the user

Skipping evals for "small" changes

Never updating the eval suite

Making Evals-Driven Development Your Default

Frequently Asked Questions

Need help building eval pipelines for your AI product?

Related Articles

Long-Running AI Tasks in UIs: Patterns That Keep Users Engaged

Cursor vs Claude Code: Which AI Coding Tool Actually Fits Your Workflow (November 2025)

How to Reduce LLM Costs Through Token Optimization

What Evals-Driven Development Actually Means

Why Traditional Testing Breaks Down for AI

Non-determinism is the default

Quality is multi-dimensional

Failure modes are emergent

Building Your First Eval Suite From Scratch

Collect real examples

Define scoring criteria

Build a simple runner

The Evals-Driven Development Loop

Write the eval first

Establish the baseline

Make one change at a time

Run evals and analyze results

Ship or iterate

What Good Evals Look Like Across Different Systems

RAG systems

Classification pipelines

Conversational agents

Code generation tools

Scaling Evals Without Drowning in Maintenance

Use a tiered evaluation strategy

Automate dataset expansion

Manage evaluation costs

Version your eval suite

Mistakes That Sabotage Evals-Driven Workflows

Evaluating on unrealistic data

Optimizing for the eval instead of the user

Skipping evals for "small" changes

Never updating the eval suite

Making Evals-Driven Development Your Default

Frequently Asked Questions

Need help building eval pipelines for your AI product?

Related Articles

Long-Running AI Tasks in UIs: Patterns That Keep Users Engaged

Cursor vs Claude Code: Which AI Coding Tool Actually Fits Your Workflow (November 2025)

How to Reduce LLM Costs Through Token Optimization