A founder I spoke with last week sent a message at midnight. Their model bill for April had landed at thirty seven thousand dollars. The same bill in January was four thousand. Traffic had grown, but not by ten times. Nothing about the product looked broken. Users were happy. The dashboards were green. Yet the meter had been spinning faster every week, and nobody on the team had noticed until accounting asked a polite but pointed question.
This is the new shape of a familiar problem. A few years ago, every growing software company learned the hard way that cloud bills can quietly run away from you. In 2026, the same lesson is being relearned, this time with tokens. Every AI feature in your product has a meter. Every chat reply, every summary, every smart suggestion, every search rerank costs real money. If nobody is watching the meter, the meter wins.
This post explains why token costs grow the way they do, where the money quietly leaks, and how to keep your AI feature profitable without making it feel cheaper to your users. It is written for founders, product leaders, and engineering managers, not for researchers. No deep machine learning required.
Why Tokens Are the New Cloud Bill
Cloud costs blindsided a generation of software teams because the unit was small and the multiplier was huge. You pay per gigabyte, per request, per second of compute. Each line item looks tiny. The total at the end of the month is anything but.
Token costs work the same way, only faster. A token is a small unit of text the model reads or writes. A short word might be one token. A longer word might be three. Every prompt, every reply, every piece of context you feed the model is measured in tokens, and you pay for both the reading and the writing.
The numbers feel harmless when you see them on a price page. A few cents per thousand tokens for a fast model. A bit more for a smarter one. The trouble is that real products do not send a thousand tokens here and there. They send hundreds of thousands, several times per user, all day, every day. A single chat message in a serious AI feature can move ten thousand tokens by the time the system prompt, the chat history, the retrieved documents, and the long reply are all counted.
Multiply that by a few thousand active users, then again by a few weeks of growth, and you have your founder calling at midnight.
Where the Money Quietly Goes
When teams finally open up their AI bill and look inside, the same five sources of waste show up again and again. None of them feel reckless on their own. Together, they add up.
Oversized System Prompts
The system prompt is the long block of instructions you send with every request to keep the model on rails. Tone of voice, rules, formatting, examples. It is easy to keep adding to it. Each new edge case adds another paragraph. Six months later the system prompt is two thousand tokens long, and you pay for all of it on every single request, even when the user just typed "hi."
Long Conversation History
Most chat features replay the whole conversation back to the model on every turn so the model can stay in context. If a user has been chatting for thirty messages, you are sending all thirty back every time. The cost of message thirty one is not the cost of one message. It is the cost of all thirty one.
Unbounded Retrieval
Retrieval augmented generation, or RAG, is the standard way to feed your own data into a model. The pattern is to find the most relevant chunks of text and stuff them into the prompt. Done well, you send a few hundred tokens of high quality context. Done lazily, you send fifty chunks of one thousand tokens each just to be safe. That single design choice can multiply your bill by ten.
Calling the Big Model for Every Task
It is tempting to point every feature at the smartest, most expensive model. The reasoning is that quality matters and the price difference looks small. The price difference is not small at scale. Many of the tasks in your product, like classification, extraction, or short rewrites, work just as well on a model that costs one tenth as much. Sending all of them to the flagship model is paying first class fare for a bus ride.
Retries and Fallbacks Nobody Counts
When a request fails or gets a bad answer, your system retries. When it times out, it retries again. When you ask the model to format something and it formats it wrong, you ask again. Each retry is a new bill. Healthy systems retry sometimes. Unhealthy ones retry on every other request, and the retries are completely invisible on the user side.
The simplest test is this. If your monthly model bill grew faster than your user base, the difference is waste, not growth.
The Hidden Multipliers People Miss
Beyond those five basics, three patterns push token costs up in ways founders rarely see coming.
**Agents that talk to themselves.** An AI agent often plans a task, checks its work, calls a tool, looks at the result, and tries again. One user request becomes ten model calls behind the scenes. The user pays a button press. You pay for ten round trips with full context each time.
**Streaming features that feel free.** Live suggestions while typing. Auto summaries as the user scrolls. Background classification of every uploaded file. Each one feels small because it is invisible. None of them ask for permission. They just run, and they bill the same as anything a user typed by hand.
**Context that grows with the customer.** A new customer signs up with twenty documents in their account. A year later they have twenty thousand. Your retrieval step still runs the same way, only now it has more to read, more to rank, and more chances to send extra context to the model. The bill grows even when the customer count stays flat.
A Simple Way to Budget AI Into a Product
You do not need a finance team or a fancy tool to get started. You need three numbers and a habit.
**Cost per user, per month.** Take last month's model bill and divide it by your active users. That is your blended unit cost. Track it every month. The day it starts climbing faster than your prices is the day to act.
**Cost per feature.** Tag every model call with the feature it belongs to. The chat box. The summary panel. The smart search. After a week you will see which features carry the bill. In almost every product I have looked at, eighty percent of the cost comes from one or two features the team did not expect.
**Cost per plan.** Compare your token spend per user to the price of the plan they are on. If a free user costs you four dollars in tokens and brings in zero, that is a product decision, not an accident. Many companies are still finding out that their free tier loses money on day one.
Once you have these three numbers, every product change has a price tag. Adding a new AI feature is no longer a free choice. Removing one is no longer a guess.
Practical Patterns That Cut the Bill
Real teams cut token spending by twenty to seventy percent without making the product worse. The wins come from a small set of patterns repeated everywhere.
Use a Smaller Model First
Pick the cheapest model that gives an acceptable answer for the task. For most classification, extraction, and short rewrite work, that is a small fast model, not the flagship. Send the hard cases to the bigger model only when the small one fails or is unsure. This single change often saves the most money for the least effort.
Cache Anything That Repeats
Most model providers now offer prompt caching for the parts of your input that stay the same across requests, like system prompts and reference documents. Turning it on takes hours and pays back in days. Some teams cut input costs in half this way without changing a single user feature.
Outside the model, cache your own answers too. If a hundred users ask the same question about your help docs, the model should answer it once.
Trim the System Prompt Hard
Read your system prompt as if you were paying for it word by word, because you are. Remove old examples that are no longer needed. Move rules into code where they belong. Replace long policy text with a link to a tool the model can call only when needed.
Summarise and Compress History
For long chats, replace the older messages with a shorter summary instead of replaying every word. Most users only need the last few turns in full detail and a sense of what came before.
Set Hard Limits
Cap the length of replies. Cap the number of retrieved chunks. Cap the number of tool calls an agent can make on one task. Hard limits feel rude in design but they save real money and they catch the worst runaway behaviour before it shows up on a bill.
Watch the Meter in Real Time
Add token spend to the same dashboard as your other operational metrics. If a deploy doubles your average tokens per request, you should know that day, not at the end of the month. A simple alert on cost per user is worth more than a long report.
When to Stop Optimising and Rebuild
Sometimes the bill is not the symptom of a few bad habits. It is the symptom of a design that was right for a prototype and wrong for a real product. The signals are clear once you know to look.
You are paying more than half of a customer's revenue back to model providers. The unit economics will never work, no matter how clever the next prompt is.
You depend on a single provider's flagship model and a price change of twenty percent would put you out of business. That is a contract you do not actually have, dressed up as a feature.
Your team is afraid to ship anything new because every change risks a cost spike nobody can predict. Velocity is dropping. The AI feature that once felt like an advantage is now a tax on everything else.
When you reach any of these points, the right move is not another round of tweaks. It is a small rebuild. Move some tasks off the model entirely. Add a caching layer. Mix in a cheaper provider for the easy work. Sometimes even rethink whether a feature needs an LLM at all, or whether a much smaller piece of logic would do.
The good news is that the work you have already done is not wasted. The user research, the prompts you have tuned, the data you have gathered. All of it carries forward into a leaner version of the same product.
A Sensible Way to Think About All of This
AI features are not free, and they were never going to be. The early days of every new technology look like magic because someone else is paying for the magic to be cheap. That window is closing for AI. Providers are shifting from growth pricing to real pricing. Investors are shifting from "build something cool" to "show me the margin."
The teams that come out of this period strongest are not the ones using the smallest model or the largest. They are the ones who treat tokens the way good engineers eventually learned to treat cloud resources. Measured. Budgeted. Tied to a feature, a user, and a price. Not feared, just understood.
If your AI bill is starting to look more like a problem than a strategy, the kindest thing you can do for your future self is to look inside it now, while the changes are still small. The fix is rarely a single clever trick. It is a habit of paying attention, and a few good patterns repeated everywhere. The product you save will be your own.