Contact

Services
Services

AI & GenAI

Custom Solution Engineering

Digital Experience

Cloud and Data

ServiceNow Solutions
AI & GenAI

From chatbots that understand context to automation that learns your processes- we build AI solutions that deliver measurable ROI, not just buzzwords.

Gen AI

What Could AI Do for Your Business Today?
Lets Connect

FEATURED WORK

The GenAI Reliability Playbook
READ MORE

Custom Solution Engineering

Tailored applications that fit your unique processes, scale with your growth, and integrate seamlessly with your existing systems.

Custom Product Development

Mobile and WebApp Development

MVP Development

IOT Application Development

Need a Solution as Unique as Your Business?
Let’s Build It Together.

FEATURED WORK

Automating Financial Workflows for Scalable Fintech Growth
READ MORE

Digital Experience

Modernise your systems and processes without disrupting your business—smooth transitions that unlock new capabilities.

Digital Experience

Ready to Redefine Your Digital Journey?
Lets Connect

FEATURED WORK

Transforming Transportation with Technology– Automating, Centralizing, and Streamlining Operations
READ MORE

Cloud and Data

Robust, secure cloud architectures and data solutions that grow with your business and keep your operations running smoothly.

Cloud Solutions

Data Solutions

Unlock the Power of Cloud and Data.
Lets Connect

FEATURED WORK

Enhancing Platform Performance for Uninterested Sports Broadcasting
READ MORE

ServiceNow Solutions

Robust, secure cloud architectures and data solutions that grow with your business and keep your operations running smoothly.

ServiceNow Solutions

Streamline Your Workflow with ServiceNow.
Lets Connect

FEATURED WORK

Benefits of Servicenow for Modern Businesses
READ MORE
Industries
Industries

Financial Services

Manufacturing & Logistics

Media & Entertainment

Retail & E-commerce

Healthcare
FEATURED WORK

Automating Financial Workflows for Scalable Fintech Growth
READ MORE

FEATURED WORK

Beyond Telematics: How Service Now is Redefining Operational Resilience in U.S. Trucking
READ MORE

FEATURED WORK

Automating Live Sports Broadcasting with IoT for HomeTeam Live
READ MORE

FEATURED WORK

Automating Live Sports Broadcasting with IoT for HomeTeam Live
READ MORE

FEATURED WORK

From Experiment to Evidence: How to Prove GenAI ROI Without the Guesswork
READ MORE
Insights
Insights

Blog

POV

Whitepaper
FEATURED WORK

GenAI-Enabled vs. GenAI-Powered: Choosing the Right Path for Enterprise Platforms
READ MORE

FEATURED WORK

GenAI Isn’t About More Models. It’s About Less Work
READ MORE

FEATURED WORK

Beyond Productivity: Building GenAI-Enabled Product Development Ecosystems
READ MORE
Case Studies
About us
Careers

The Hidden Cost of GenAI Instability: Why “Sometimes Right” Is Still a Failure

March 13, 2026 | Author: Christian Gylseth

A robotic hand reaching towards a stack of coins on a table, with a blurred blue bokeh background.

A GenAI feature that’s right “most of the time” feels fine in a demo. Then it hits a real workflow.

A support agent asks the same refund question twice – two different answers. A compliance analyst reruns a risk summary – same facts, new tone, new conclusion. An engineer asks for a fix plan – step 3 changes on the second run. And suddenly the team isn’t moving faster. They’re doing the same work twice, plus the cleanup.

Here’s the thing: enterprise work runs on repeatability. Not vibes. If the output can’t be trusted to stay steady, “sometimes right” becomes a failure mode, not a success story.

So what do we mean by “instability,” really?

It’s not just “a few mistakes.” It’s variance that shows up in places where variance breaks the job.

same intent → different answer
tiny wording change → big shift in recommendation
correct facts mixed with invented ones
confident tone with low truth
outputs that can’t be reproduced later (the worst kind, because you can’t debug it)

And yes, hallucination is part of it. But instability is broader. It’s the system behaving like a slot machine inside a process that expects a calculator.

The numbers back this up. Across enterprise-style tasks, measured hallucination rates can be uncomfortably high – legal and business factual queries have been reported in the ~58% to 88% range in one study, while finance and compliance-style tasks show material error rates as well. Healthcare citation and clinical guidance tasks also show wide spreads by model and setup.

Even when the model isn’t hallucinating, it can still wobble. Self-consistency measures (run the same prompt many times and see how often it repeats the same answer) often land around 60–85% on straightforward prompts. That means 15–40% of runs differ. And in repeated-run reasoning tests, 15–35% of questions can flip answers across runs.

If you’re building a workflow, that’s not a rounding error. That’s operational noise.

The hidden cost map (aka: where the budget quietly bleeds)

Instability doesn’t show up as a clean line item called “LLM variance.” It hides in normal work:

1) Rework loops

Someone has to verify. Then rewrite. Then rerun. Then compare.
And the worst part? People stop trusting the first output, even when it’s correct. So every task becomes a “trust but verify” ritual.

Surveys and syntheses don’t always isolate “review time” cleanly, but productivity research that adjusts for quality makes the point: gains aren’t only about speed; they depend on reducing correction cycles too.

2) Escalations

Instability pushes work up the chain.

Support is a clean example because escalation has a real cost curve. Average ticket costs vary widely by channel and tier, but ranges like $5–$60 per ticket show up across industry summaries, and moving from L1 to L2 is often 2–3×, with L3 or engineering escalations commonly 3–5×.

So a “small” instability rate that triggers even a modest bump in escalations can erase the savings you thought you were getting from automation.

3) Process breakdowns

Some workflows simply don’t tolerate wobble:

refunds and policy decisions
KYC/AML cues
incident response steps
contract clause extraction
safety or compliance routing

If the system can’t repeat itself, you can’t build dependable automation around it.

4) Trust tax (the slowest killer)

When teams feel burned, they shrink usage:
“use it only for drafts,” “only for internal,” “only when we have time to check.”

This is how GenAI tools end up parked inside tasks instead of core workflows.

And yes, there are real-world examples of customer-facing bots inventing policy and causing direct legal and reputational fallout (Air Canada is the classic headline case). There are also SaaS support bot incidents where a hallucinated policy went viral and triggered cancellations. Klarna’s public back-and-forth on automation in support is another caution sign.

5) Risk exposure

In regulated spaces, instability isn’t “oops.” It’s a liability. Legal hallucinations (fake cases, fake citations) have already led to sanctions and disciplinary action.

Why it happens (and why “just prompt it better” doesn’t hold)

You can’t fix instability with clever wording alone because a lot of it comes from system behavior:

randomness in decoding (even when you think you “locked it down”)
model updates and routing changes behind APIs
retrieval drift (different documents retrieved on different runs)
messy context: long threads, half policies, outdated docs
unclear acceptance criteria (nobody defined what “correct” means)

Also: humans add fuel. Automation bias is real. In studies across domains, overreliance on automated advice can raise error rates by ~15–30%, mainly because people skip checks when the system looks confident.

That’s not a “user training issue.” It’s a design and governance issue.

What “stable performance” actually looks like

Stability doesn’t mean the same wording every time. It means the same decision class every time.

A stable GenAI system usually has:

consistent decisions (what action it recommends doesn’t swing)
traceability (you can explain which sources or rules drove the output)
controlled variance (tone can vary, facts can’t)
honest uncertainty (it asks a question, or routes to a human, instead of guessing)

The easiest mental model is: deterministic core, generative edge.
Rules, numbers, eligibility, actions, and compliance stay structured. Language sits around that as a wrapper.

And you measure it like you measure any serious service: with reliability signals, not gut feel. That includes accuracy, groundedness/faithfulness, refusal correctness, calibration, and consistency over repeated runs—plus classic SRE signals like latency and error rates. Teams are already starting to express these as SLO-style targets (example patterns include “hallucinated policy facts < X% weekly” and “grounded answers > Y%”).

How to get there without slowing everything down

You don’t need a massive research lab. You need engineering discipline in the right spots.

Write acceptance criteria like it’s a feature spec
What counts as correct? What’s a critical error vs a harmless phrasing change?
Test for repeatability, not just “one good run”
Single-run scoring lies. Repeat the same prompts multiple times and measure variance.
Ground facts, then generate around them
Retrieval-based setups often improve factual quality and reduce unsupported claims. Some comparisons show meaningful jumps in accuracy and faithfulness, plus large reductions in fake citations when grounded answers are required.
Use structured outputs where it matters
If the output feeds a workflow, don’t accept free-form text. JSON schemas, constrained decoding, and tool/function calling reduce failure rates sharply. Studies and tool evaluations report malformed/invalid structured outputs dropping from double digits (like ~12–18% in some settings) down to low single digits (~0–2%) under constraints, with reliability climbing toward 98–99%.
Treat prompts like code
Version them in Git. Review changes. Run regression tests. Roll back fast. Tools like LaunchDarkly-style flags and LLM eval platforms exist for this, but the core idea is simple: if you can’t reproduce a change, you can’t control it.
Monitor the right incidents
If a high-impact outage can cost anywhere from hundreds of thousands to millions per incident (and some reports cite multi-million per hour ranges in high-impact cases), you don’t want AI instability adding minutes to response time because the “assistant” keeps changing its story.

The takeaway

GenAI that’s “sometimes right” is like a flaky test suite. It doesn’t matter that it passes sometimes. You still can’t ship with confidence.

Speed matters, sure. But stable output is what turns speed into a real business win. Less rework. Fewer escalations. Fewer ugly surprises in production.

If you’re building GenAI into core workflows, ask one blunt question early: can this system repeat itself when it counts?

Christian Gylseth

Changing ONE thing can make ALL the Difference in your business

Next project? Choose Amazatic for expert solutions!

Let's Contact

Services

AI & GenAI

FEATURED WORK

The GenAI Reliability Playbook

Custom Solution Engineering

FEATURED WORK

Automating Financial Workflows for Scalable Fintech Growth

Digital Experience

FEATURED WORK

Transforming Transportation with Technology– Automating, Centralizing, and Streamlining Operations

Cloud and Data

FEATURED WORK

Enhancing Platform Performance for Uninterested Sports Broadcasting

ServiceNow Solutions

FEATURED WORK

Benefits of Servicenow for Modern Businesses

Industries

FEATURED WORK

Automating Financial Workflows for Scalable Fintech Growth

FEATURED WORK

Beyond Telematics: How Service Now is Redefining Operational Resilience in U.S. Trucking

FEATURED WORK

Automating Live Sports Broadcasting with IoT for HomeTeam Live

FEATURED WORK

Automating Live Sports Broadcasting with IoT for HomeTeam Live

FEATURED WORK

From Experiment to Evidence: How to Prove GenAI ROI Without the Guesswork

Insights

FEATURED WORK

GenAI-Enabled vs. GenAI-Powered: Choosing the Right Path for Enterprise Platforms

FEATURED WORK

GenAI Isn’t About More Models. It’s About Less Work

FEATURED WORK

Beyond Productivity: Building GenAI-Enabled Product Development Ecosystems

The Hidden Cost of GenAI Instability: Why “Sometimes Right” Is Still a Failure

So what do we mean by “instability,” really?

The hidden cost map (aka: where the budget quietly bleeds)

1) Rework loops

2) Escalations

3) Process breakdowns

4) Trust tax (the slowest killer)

5) Risk exposure

Why it happens (and why “just prompt it better” doesn’t hold)

What “stable performance” actually looks like

How to get there without slowing everything down

The takeaway