The Hidden Cost of GenAI Instability: Why “Sometimes Right” Is Still a Failure

A GenAI feature that’s right “most of the time” feels fine in a demo. Then it hits a real workflow.
A support agent asks the same refund question twice – two different answers. A compliance analyst reruns a risk summary – same facts, new tone, new conclusion. An engineer asks for a fix plan – step 3 changes on the second run. And suddenly the team isn’t moving faster. They’re doing the same work twice, plus the cleanup.
Here’s the thing: enterprise work runs on repeatability. Not vibes. If the output can’t be trusted to stay steady, “sometimes right” becomes a failure mode, not a success story.
So what do we mean by “instability,” really?
It’s not just “a few mistakes.” It’s variance that shows up in places where variance breaks the job.
- same intent → different answer
- tiny wording change → big shift in recommendation
- correct facts mixed with invented ones
- confident tone with low truth
- outputs that can’t be reproduced later (the worst kind, because you can’t debug it)
And yes, hallucination is part of it. But instability is broader. It’s the system behaving like a slot machine inside a process that expects a calculator.
The numbers back this up. Across enterprise-style tasks, measured hallucination rates can be uncomfortably high – legal and business factual queries have been reported in the ~58% to 88% range in one study, while finance and compliance-style tasks show material error rates as well. Healthcare citation and clinical guidance tasks also show wide spreads by model and setup.
Even when the model isn’t hallucinating, it can still wobble. Self-consistency measures (run the same prompt many times and see how often it repeats the same answer) often land around 60–85% on straightforward prompts. That means 15–40% of runs differ. And in repeated-run reasoning tests, 15–35% of questions can flip answers across runs.
If you’re building a workflow, that’s not a rounding error. That’s operational noise.
The hidden cost map (aka: where the budget quietly bleeds)
Instability doesn’t show up as a clean line item called “LLM variance.” It hides in normal work:
1) Rework loops
Someone has to verify. Then rewrite. Then rerun. Then compare.
And the worst part? People stop trusting the first output, even when it’s correct. So every task becomes a “trust but verify” ritual.
Surveys and syntheses don’t always isolate “review time” cleanly, but productivity research that adjusts for quality makes the point: gains aren’t only about speed; they depend on reducing correction cycles too.
2) Escalations
Instability pushes work up the chain.
Support is a clean example because escalation has a real cost curve. Average ticket costs vary widely by channel and tier, but ranges like $5–$60 per ticket show up across industry summaries, and moving from L1 to L2 is often 2–3×, with L3 or engineering escalations commonly 3–5×.
So a “small” instability rate that triggers even a modest bump in escalations can erase the savings you thought you were getting from automation.
3) Process breakdowns
Some workflows simply don’t tolerate wobble:
- refunds and policy decisions
- KYC/AML cues
- incident response steps
- contract clause extraction
- safety or compliance routing
If the system can’t repeat itself, you can’t build dependable automation around it.
4) Trust tax (the slowest killer)
When teams feel burned, they shrink usage:
“use it only for drafts,” “only for internal,” “only when we have time to check.”
This is how GenAI tools end up parked inside tasks instead of core workflows.
And yes, there are real-world examples of customer-facing bots inventing policy and causing direct legal and reputational fallout (Air Canada is the classic headline case). There are also SaaS support bot incidents where a hallucinated policy went viral and triggered cancellations. Klarna’s public back-and-forth on automation in support is another caution sign.
5) Risk exposure
In regulated spaces, instability isn’t “oops.” It’s a liability. Legal hallucinations (fake cases, fake citations) have already led to sanctions and disciplinary action.
Why it happens (and why “just prompt it better” doesn’t hold)
You can’t fix instability with clever wording alone because a lot of it comes from system behavior:
- randomness in decoding (even when you think you “locked it down”)
- model updates and routing changes behind APIs
- retrieval drift (different documents retrieved on different runs)
- messy context: long threads, half policies, outdated docs
- unclear acceptance criteria (nobody defined what “correct” means)
Also: humans add fuel. Automation bias is real. In studies across domains, overreliance on automated advice can raise error rates by ~15–30%, mainly because people skip checks when the system looks confident.
That’s not a “user training issue.” It’s a design and governance issue.
What “stable performance” actually looks like
Stability doesn’t mean the same wording every time. It means the same decision class every time.
A stable GenAI system usually has:
- consistent decisions (what action it recommends doesn’t swing)
- traceability (you can explain which sources or rules drove the output)
- controlled variance (tone can vary, facts can’t)
- honest uncertainty (it asks a question, or routes to a human, instead of guessing)
The easiest mental model is: deterministic core, generative edge.
Rules, numbers, eligibility, actions, and compliance stay structured. Language sits around that as a wrapper.
And you measure it like you measure any serious service: with reliability signals, not gut feel. That includes accuracy, groundedness/faithfulness, refusal correctness, calibration, and consistency over repeated runs—plus classic SRE signals like latency and error rates. Teams are already starting to express these as SLO-style targets (example patterns include “hallucinated policy facts < X% weekly” and “grounded answers > Y%”).
How to get there without slowing everything down
You don’t need a massive research lab. You need engineering discipline in the right spots.
- Write acceptance criteria like it’s a feature spec
What counts as correct? What’s a critical error vs a harmless phrasing change? - Test for repeatability, not just “one good run”
Single-run scoring lies. Repeat the same prompts multiple times and measure variance. - Ground facts, then generate around them
Retrieval-based setups often improve factual quality and reduce unsupported claims. Some comparisons show meaningful jumps in accuracy and faithfulness, plus large reductions in fake citations when grounded answers are required. - Use structured outputs where it matters
If the output feeds a workflow, don’t accept free-form text. JSON schemas, constrained decoding, and tool/function calling reduce failure rates sharply. Studies and tool evaluations report malformed/invalid structured outputs dropping from double digits (like ~12–18% in some settings) down to low single digits (~0–2%) under constraints, with reliability climbing toward 98–99%. - Treat prompts like code
Version them in Git. Review changes. Run regression tests. Roll back fast. Tools like LaunchDarkly-style flags and LLM eval platforms exist for this, but the core idea is simple: if you can’t reproduce a change, you can’t control it. - Monitor the right incidents
If a high-impact outage can cost anywhere from hundreds of thousands to millions per incident (and some reports cite multi-million per hour ranges in high-impact cases), you don’t want AI instability adding minutes to response time because the “assistant” keeps changing its story.
The takeaway
GenAI that’s “sometimes right” is like a flaky test suite. It doesn’t matter that it passes sometimes. You still can’t ship with confidence.
Speed matters, sure. But stable output is what turns speed into a real business win. Less rework. Fewer escalations. Fewer ugly surprises in production.
If you’re building GenAI into core workflows, ask one blunt question early: can this system repeat itself when it counts?