How to Test an AI Feature in 2026 - A Practical Guide to Evals

A team showed me their new AI feature last month. It was a support assistant built into their product. The demo was clean. They typed five questions, the assistant answered all five well, the room nodded, and the CEO asked when it could go live.

That conversation happened four months ago. The feature still has not shipped.

Nothing dramatic went wrong. The team just kept finding answers that were slightly off. A refund question handled correctly on Monday was handled wrong on Thursday. A fix for one bad answer quietly broke three good ones. Every week the assistant felt almost done, and every week a new bad answer showed up. Nobody could say whether it was getting better or worse, because nobody was measuring it.

This is the most common way an AI project dies in 2026. Not a crash. Not a cancelled budget. Just a slow loss of confidence, because the team built the feature but never built a way to test it.

The habit that fixes this has a boring name. Evals, short for evaluations. It is the single clearest line I see between teams that ship AI features and teams that stay stuck in demo mode. Here is what an eval is, why it works, and how to start without buying anything.

Why You Cannot Test AI the Way You Test Normal Software

Normal software is predictable. The same input gives the same output, every single time. That is what lets you write a test that says "when the user adds two and two, the result must equal four." The test passes or it fails. There is no middle.

An AI feature does not behave that way. The same question can produce two different answers on two different runs. The meaning may be the same, but the wording changes, the order changes, sometimes a detail appears or disappears. A test that checks for an exact string will fail constantly, even when the answer is good.

There are two more differences that matter. First, most AI questions have no single correct answer. "Summarize this ticket" can be done ten good ways and twenty bad ways. There is nothing to compare against, character for character. Second, the model itself changes under you. You swap to a newer model to save money or gain speed, and behavior you never tested suddenly shifts. Normal software does not rewrite itself when you upgrade a library.

So the old testing toolkit does not fit. Checking that output equals an expected value works for a checkout calculation. It falls apart for a feature whose whole job is to produce language. You need a way to measure quality that accepts variation, scores behavior instead of exact words, and gives you a number you can trust over time.

A side by side comparison of testing normal software, which is predictable and pass or fail, against testing an AI feature, which is variable and scored by quality — Normal software gives the same answer every time. AI features need a different kind of test.

What an Eval Actually Is

Strip away the jargon and an eval is three simple things.

A set of test cases. A way to score the output of each case. A number at the end that you can compare over time.

A test case is an input plus a description of what a good response looks like. Notice the word description. You are not writing down the exact answer you expect. You are writing down the behavior you expect. For a support assistant, a case might be the question "How do I get a refund?" paired with the note "must explain the 30 day window, must link the refund page, must stay polite, must not promise a refund it cannot guarantee."

A scorer reads the model's output and decides how well it matched that note. Sometimes the scorer is a tiny piece of code, for example a check that the output contains the right link. Sometimes the scorer is a person reading the answer. And increasingly the scorer is another model, given the same note and asked to grade the answer against it.

The score is the part that changes everything. Once every case produces a number, the whole set produces a number. Now you can say "the assistant scored 82 out of 100 yesterday and 79 today." You stop arguing about whether the feature feels better. You look at the trend. That shift, from opinion to measurement, is the entire point of an eval.

A diagram showing the four parts of an eval, an input, an expected behavior, a scorer, and a score that can be tracked over time — An eval is an input, a description of good behavior, a scorer, and a number you can track.

The Three Kinds of Evals You Need

Not every test case can be scored the same way. After watching a fair number of teams do this, three kinds of evals keep showing up. A healthy AI feature uses all three.

**Ground truth evals.** These are cases that do have a known correct answer. "What is the price of the Pro plan?" has exactly one right answer. "Which category does this ticket belong to?" has a fixed list of options. You score these with simple code. Did the output contain the right number, pick the right category, return the right shape. Ground truth evals are cheap to run, fast, and you can have hundreds of them. They catch the dumbest and most embarrassing failures.

**Criteria evals.** These are for the cases with no single right answer, which is most of them. You cannot check a summary against one correct string. So instead you write a checklist. Did the summary stay under five sentences. Did it keep the customer name. Did it avoid inventing facts. Did it match the requested tone. Each item on the checklist is a yes or no. The score is how many items passed. Criteria evals are usually graded by a model acting as a judge, because a human cannot grade a thousand of them every day.

**Human review evals.** A person reads a sample of real outputs and rates them. This is slow and expensive, so you use it sparingly. But it is the ground truth for everything else. Humans catch problems your checklist never imagined. You also use human review to check that your model judge is grading fairly. If the model judge and your reviewers agree most of the time, you can trust the judge for the daily runs. If they disagree, your checklist needs work.

The mix matters. Lean only on ground truth evals and you will miss every quality and tone problem. Lean only on human review and you will run it once a quarter and learn nothing useful in between. The teams that ship use cheap evals constantly and expensive evals occasionally.

A pyramid of the three kinds of evals, with many cheap ground truth evals at the base, criteria evals in the middle, and a small layer of human review at the top — Run the cheap evals constantly. Run the expensive human review occasionally.

How to Build Your First Eval Set in an Afternoon

You do not need a platform or a budget to start. You need one feature, a few hours, and a willingness to write things down.

Start by picking the AI feature in your product that worries you the most. The one with the demo that impresses people and the production behavior that nobody trusts. That is your candidate.

Next, collect real inputs. Open your logs, your support history, your past chats, and pull fifteen to twenty real questions or tasks that users actually sent. Real inputs beat invented ones every time, because real users are messier and more creative than you are. Make sure a few of them are the awkward cases. The rude question. The question with a typo. The question that should be refused.

For each input, write the expected behavior in one or two plain sentences. Not the answer. The behavior. What must be present, what must be avoided, what tone is right. This is the slowest step and the most valuable one, because it forces your team to actually agree on what good looks like.

Then write the simplest scorer that works. For the cases with a clear right answer, a few lines of code. For the rest, a short checklist handed to a model judge. Do not over engineer this. A rough scorer today beats a perfect scorer next month.

Run the whole set. Record the number. That number is almost always lower than the team expected, and that is the point. You now have a baseline. From here, every change to the prompt, the data, the model, or the tools gets run against the same set before it ships. The number goes up or it goes down, and you finally know which.

A loop showing a change, an eval run, a score, and a decision to ship or fix, repeating on every change — Every change runs through the same eval set before it ships. The number tells you which way it moved.

Where Teams Get Evals Wrong

Evals are simple to describe and easy to get wrong. The same mistakes show up again and again.

**Waiting until something breaks.** A bug in production is the worst possible time to start writing tests. By then the pressure is high and the team is guessing. Build a small eval set on day one, even with only ten cases. It will feel like overhead. It is not.

**Testing only the happy path.** It is tempting to fill the eval set with polite, well formed questions, because those score well and feel good. Real users are not polite or well formed. Half of your cases should be the hard ones. The vague request, the trick question, the thing the feature should refuse.

**Chasing a perfect score.** If your eval set scores 100 out of 100, the set is too easy, not the feature too good. A useful eval set always has a few cases the feature still fails. Those failures are your roadmap. When you fix them, add harder ones.

**Building it too large too soon.** A team decides to do evals properly, designs a five hundred case suite, and never finishes it. Twenty good cases you actually run beat five hundred you never wrote. Start tiny. Grow the set every time you find a real failure in production.

**Letting the set go stale.** Your product changes, your users change, and an eval set written six months ago slowly stops reflecting reality. Add new cases from real failures every week. Retire cases that no longer apply. The set is a living thing.

**Trusting the model judge blindly.** A model grading other model output is powerful and not perfect. It has its own biases. Check it against human review now and then. If the judge is too generous or too harsh, fix its checklist before you rely on its scores.

Reading the Score: What Good Looks Like

Once you have a number, the next question is what to do with it. The honest answer is that the absolute number matters less than the direction.

A feature scoring 84 is not automatically ready and a feature scoring 72 is not automatically broken. What you care about is the trend. Is the score climbing as the team works. Did a change that felt like an improvement actually move the number up. Did a model swap that was supposed to be invisible quietly cost you five points.

The most useful thing an eval gives you is a regression alarm. You change a prompt to fix one customer complaint, you run the set, and you see that three other cases now fail. Without the eval that breakage ships and a different customer complains next week. With the eval you catch it in two minutes, before anyone outside the team ever sees it.

Set a simple rule the whole team agrees on. The score may not drop below its current baseline without a clear reason. That one rule turns a vague feeling of quality into a line nobody is allowed to cross. It is the same discipline that a passing test suite gives normal software, finally available for AI features.

A six step checklist for launching an eval habit on one team with one feature in a single week — A one week experiment any team can run without buying a tool.

A Pragmatic Way to Start

You will not get this perfect on the first try, and you do not need to. The goal this week is not a polished evaluation platform. It is one habit, on one feature, that the team actually keeps.

Pick the feature. Pull twenty real inputs. Write down what good looks like for each. Score them with the simplest method that works. Get your baseline number. Then make the rule: every change runs the set first.

After you have done that for one feature, the second one takes half the time, because you already have the scorer, the judge checklist, and the muscle memory. After three or four features you will have a small shared library of eval helpers and a team that argues about numbers instead of opinions.

If your team already works in a [spec driven](/blog/spec-driven-development-replacing-tickets-2026) way, the spec is the natural home for the eval. The spec for an AI feature should say what the feature must do and how it will be measured, in the same document. And if your evals keep failing in ways that trace back to bad or missing information rather than a bad prompt, the real problem may be upstream, in whether [your data is ready for AI](/blog/why-your-data-isnt-ready-for-ai-and-what-to-fix-first) at all.

The Bigger Pattern

Step back and evals are not really about testing. They are about confidence.

The team I described at the start did not lack skill. They had a working feature and a good demo. What they lacked was a way to know, on any given day, whether the thing was good enough to put in front of customers. Without that, every launch decision became a gut call, and gut calls under uncertainty almost always end in another week of delay.

An eval converts that uncertainty into a number. A number can be tracked, defended, and trusted. It lets a team say "we are at 88, we agreed 85 is the bar, we ship." That sentence is the difference between an AI feature that lives in a demo forever and one that reaches real users.

The teams that win the AI race in 2026 will not be the ones with the cleverest prompts or the newest model. Those advantages last weeks. The winners will be the teams that built the habit of measuring their AI, so that every change makes the product provably better instead of merely different.

A demo shows that your AI feature can work. An eval shows that it does work. Only one of those is allowed to ship.

Talk to us about your AI product roadmap

How to Test an AI Feature: Why Evals Are the Habit That Decides Which Teams Ship in 2026

Why You Cannot Test AI the Way You Test Normal Software

What an Eval Actually Is

The Three Kinds of Evals You Need

How to Build Your First Eval Set in an Afternoon

Where Teams Get Evals Wrong

Reading the Score: What Good Looks Like

A Pragmatic Way to Start

The Bigger Pattern

Let's Discuss Your Project

How to Test an AI Feature: Why Evals Are the Habit That Decides Which Teams Ship in 2026

Why You Cannot Test AI the Way You Test Normal Software

What an Eval Actually Is

The Three Kinds of Evals You Need

How to Build Your First Eval Set in an Afternoon

Where Teams Get Evals Wrong

Reading the Score: What Good Looks Like

A Pragmatic Way to Start

The Bigger Pattern

Let's Discuss Your Project

Continue Reading

Context Engineering: The Skill Replacing Prompt Engineering for AI First Teams in 2026

AI Agents in Business: What They Are, What They Can Actually Do, and When You Should Build One

Why Your Data Is Not Ready for AI and What to Fix Before You Build Anything