A team showed me their new AI feature last month. It was a support assistant built into their product. The demo was clean. They typed five questions, the assistant answered all five well, the room nodded, and the CEO asked when it could go live.
That conversation happened four months ago. The feature still has not shipped.
Nothing dramatic went wrong. The team just kept finding answers that were slightly off. A refund question handled correctly on Monday was handled wrong on Thursday. A fix for one bad answer quietly broke three good ones. Every week the assistant felt almost done, and every week a new bad answer showed up. Nobody could say whether it was getting better or worse, because nobody was measuring it.
This is the most common way an AI project dies in 2026. Not a crash. Not a cancelled budget. Just a slow loss of confidence, because the team built the feature but never built a way to test it.
The habit that fixes this has a boring name. Evals, short for evaluations. It is the single clearest line I see between teams that ship AI features and teams that stay stuck in demo mode. Here is what an eval is, why it works, and how to start without buying anything.
Why You Cannot Test AI the Way You Test Normal Software
Normal software is predictable. The same input gives the same output, every single time. That is what lets you write a test that says "when the user adds two and two, the result must equal four." The test passes or it fails. There is no middle.
An AI feature does not behave that way. The same question can produce two different answers on two different runs. The meaning may be the same, but the wording changes, the order changes, sometimes a detail appears or disappears. A test that checks for an exact string will fail constantly, even when the answer is good.
There are two more differences that matter. First, most AI questions have no single correct answer. "Summarize this ticket" can be done ten good ways and twenty bad ways. There is nothing to compare against, character for character. Second, the model itself changes under you. You swap to a newer model to save money or gain speed, and behavior you never tested suddenly shifts. Normal software does not rewrite itself when you upgrade a library.
So the old testing toolkit does not fit. Checking that output equals an expected value works for a checkout calculation. It falls apart for a feature whose whole job is to produce language. You need a way to measure quality that accepts variation, scores behavior instead of exact words, and gives you a number you can trust over time.
What an Eval Actually Is
Strip away the jargon and an eval is three simple things.
A set of test cases. A way to score the output of each case. A number at the end that you can compare over time.
A test case is an input plus a description of what a good response looks like. Notice the word description. You are not writing down the exact answer you expect. You are writing down the behavior you expect. For a support assistant, a case might be the question "How do I get a refund?" paired with the note "must explain the 30 day window, must link the refund page, must stay polite, must not promise a refund it cannot guarantee."
A scorer reads the model's output and decides how well it matched that note. Sometimes the scorer is a tiny piece of code, for example a check that the output contains the right link. Sometimes the scorer is a person reading the answer. And increasingly the scorer is another model, given the same note and asked to grade the answer against it.
The score is the part that changes everything. Once every case produces a number, the whole set produces a number. Now you can say "the assistant scored 82 out of 100 yesterday and 79 today." You stop arguing about whether the feature feels better. You look at the trend. That shift, from opinion to measurement, is the entire point of an eval.
The Three Kinds of Evals You Need
Not every test case can be scored the same way. After watching a fair number of teams do this, three kinds of evals keep showing up. A healthy AI feature uses all three.
**Ground truth evals.** These are cases that do have a known correct answer. "What is the price of the Pro plan?" has exactly one right answer. "Which category does this ticket belong to?" has a fixed list of options. You score these with simple code. Did the output contain the right number, pick the right category, return the right shape. Ground truth evals are cheap to run, fast, and you can have hundreds of them. They catch the dumbest and most embarrassing failures.
**Criteria evals.** These are for the cases with no single right answer, which is most of them. You cannot check a summary against one correct string. So instead you write a checklist. Did the summary stay under five sentences. Did it keep the customer name. Did it avoid inventing facts. Did it match the requested tone. Each item on the checklist is a yes or no. The score is how many items passed. Criteria evals are usually graded by a model acting as a judge, because a human cannot grade a thousand of them every day.
**Human review evals.** A person reads a sample of real outputs and rates them. This is slow and expensive, so you use it sparingly. But it is the ground truth for everything else. Humans catch problems your checklist never imagined. You also use human review to check that your model judge is grading fairly. If the model judge and your reviewers agree most of the time, you can trust the judge for the daily runs. If they disagree, your checklist needs work.
The mix matters. Lean only on ground truth evals and you will miss every quality and tone problem. Lean only on human review and you will run it once a quarter and learn nothing useful in between. The teams that ship use cheap evals constantly and expensive evals occasionally.
How to Build Your First Eval Set in an Afternoon
You do not need a platform or a budget to start. You need one feature, a few hours, and a willingness to write things down.
Start by picking the AI feature in your product that worries you the most. The one with the demo that impresses people and the production behavior that nobody trusts. That is your candidate.
Next, collect real inputs. Open your logs, your support history, your past chats, and pull fifteen to twenty real questions or tasks that users actually sent. Real inputs beat invented ones every time, because real users are messier and more creative than you are. Make sure a few of them are the awkward cases. The rude question. The question with a typo. The question that should be refused.
For each input, write the expected behavior in one or two plain sentences. Not the answer. The behavior. What must be present, what must be avoided, what tone is right. This is the slowest step and the most valuable one, because it forces your team to actually agree on what good looks like.
Then write the simplest scorer that works. For the cases with a clear right answer, a few lines of code. For the rest, a short checklist handed to a model judge. Do not over engineer this. A rough scorer today beats a perfect scorer next month.
Run the whole set. Record the number. That number is almost always lower than the team expected, and that is the point. You now have a baseline. From here, every change to the prompt, the data, the model, or the tools gets run against the same set before it ships. The number goes up or it goes down, and you finally know which.
Where Teams Get Evals Wrong
Evals are simple to describe and easy to get wrong. The same mistakes show up again and again.
**Waiting until something breaks.** A bug in production is the worst possible time to start writing tests. By then the pressure is high and the team is guessing. Build a small eval set on day one, even with only ten cases. It will feel like overhead. It is not.
**Testing only the happy path.** It is tempting to fill the eval set with polite, well formed questions, because those score well and feel good. Real users are not polite or well formed. Half of your cases should be the hard ones. The vague request, the trick question, the thing the feature should refuse.
**Chasing a perfect score.** If your eval set scores 100 out of 100, the set is too easy, not the feature too good. A useful eval set always has a few cases the feature still fails. Those failures are your roadmap. When you fix them, add harder ones.
**Building it too large too soon.** A team decides to do evals properly, designs a five hundred case suite, and never finishes it. Twenty good cases you actually run beat five hundred you never wrote. Start tiny. Grow the set every time you find a real failure in production.
**Letting the set go stale.** Your product changes, your users change, and an eval set written six months ago slowly stops reflecting reality. Add new cases from real failures every week. Retire cases that no longer apply. The set is a living thing.
**Trusting the model judge blindly.** A model grading other model output is powerful and not perfect. It has its own biases. Check it against human review now and then. If the judge is too generous or too harsh, fix its checklist before you rely on its scores.
Reading the Score: What Good Looks Like
Once you have a number, the next question is what to do with it. The honest answer is that the absolute number matters less than the direction.
A feature scoring 84 is not automatically ready and a feature scoring 72 is not automatically broken. What you care about is the trend. Is the score climbing as the team works. Did a change that felt like an improvement actually move the number up. Did a model swap that was supposed to be invisible quietly cost you five points.
The most useful thing an eval gives you is a regression alarm. You change a prompt to fix one customer complaint, you run the set, and you see that three other cases now fail. Without the eval that breakage ships and a different customer complains next week. With the eval you catch it in two minutes, before anyone outside the team ever sees it.
Set a simple rule the whole team agrees on. The score may not drop below its current baseline without a clear reason. That one rule turns a vague feeling of quality into a line nobody is allowed to cross. It is the same discipline that a passing test suite gives normal software, finally available for AI features.
A Pragmatic Way to Start
You will not get this perfect on the first try, and you do not need to. The goal this week is not a polished evaluation platform. It is one habit, on one feature, that the team actually keeps.
Pick the feature. Pull twenty real inputs. Write down what good looks like for each. Score them with the simplest method that works. Get your baseline number. Then make the rule: every change runs the set first.
After you have done that for one feature, the second one takes half the time, because you already have the scorer, the judge checklist, and the muscle memory. After three or four features you will have a small shared library of eval helpers and a team that argues about numbers instead of opinions.
If your team already works in a [spec driven](/blog/spec-driven-development-replacing-tickets-2026) way, the spec is the natural home for the eval. The spec for an AI feature should say what the feature must do and how it will be measured, in the same document. And if your evals keep failing in ways that trace back to bad or missing information rather than a bad prompt, the real problem may be upstream, in whether [your data is ready for AI](/blog/why-your-data-isnt-ready-for-ai-and-what-to-fix-first) at all.
The Bigger Pattern
Step back and evals are not really about testing. They are about confidence.
The team I described at the start did not lack skill. They had a working feature and a good demo. What they lacked was a way to know, on any given day, whether the thing was good enough to put in front of customers. Without that, every launch decision became a gut call, and gut calls under uncertainty almost always end in another week of delay.
An eval converts that uncertainty into a number. A number can be tracked, defended, and trusted. It lets a team say "we are at 88, we agreed 85 is the bar, we ship." That sentence is the difference between an AI feature that lives in a demo forever and one that reaches real users.
The teams that win the AI race in 2026 will not be the ones with the cleverest prompts or the newest model. Those advantages last weeks. The winners will be the teams that built the habit of measuring their AI, so that every change makes the product provably better instead of merely different.
A demo shows that your AI feature can work. An eval shows that it does work. Only one of those is allowed to ship.