Why AI Responses Fail in Production: 6 Risks You Must Design For

It always starts the same way.

Someone shares a screenshot in the incident channel.

A sentence that went to a customer. A number that didn’t exist. A citation to a paper that was never written.

And then the question:

“How did this get out?”

In 2025, Deloitte submitted a government report containing fabricated academic citations, nonexistent research papers, and an invented quote from a federal court judge—all generated by AI. A researcher spotted it almost immediately.

It wasn’t an isolated mistake. It happened again, two months later, in a different country.

But the hallucination isn’t the point.

Hallucination is a known property of large language models. It has been formally proven to be an inherent limitation—not a temporary flaw that scale will fix.

The point is that nobody designed for it.

The model did exactly what models do. The system around it had no structure to catch it, verify it, or stop it.

That’s the pattern this post is about.

The risks that exist before your first prompt

I call them Tier-0 Risks.

Not because they’re theoretical.

Because they are so fundamental that if you don’t address them before launch, nothing else you do matters.

Not your model selection. Not your prompt engineering. Not your RAG pipeline.

Tier-0 Risks are not bugs. They’re not edge cases.

They are what happens when you put a probabilistic system into a deterministic business process—and forget that the two operate under completely different rules.

Traditional software either returns the right answer or throws an error.

An LLM does something far more dangerous.

It returns a wrong answer that looks right.

And it does so with absolute confidence.

That changes everything about how you design.

Risk #1 — The Vanishing Request

A user sends a request. The AI system silently drops it.

No response. No error. No trace.

From the user’s perspective, their request simply vanished.

This happens because most teams treat the LLM call like any other HTTP request. But LLM calls are different. They’re slow—seconds, not milliseconds. They’re expensive. They can timeout mid-stream.

And if you don’t explicitly track what went in, you have no way to prove anything came out.

The damage isn’t a wrong answer.

The damage is silence.

In a support system, a customer’s complaint disappears. In an approval workflow, a request sits in limbo. The user doesn’t know if the system failed or if they were ignored.

Loss of evidence is worse than loss of accuracy.

What you need is an envelope—a structured record of what was asked, who asked it, when, and with what context. If the processing fails, the envelope still exists. The request is traceable. The failure is auditable.

Without an envelope, you don’t have a failed request.

You have a request that never existed.

Risk #2 — The Confident Wrong Answer

The AI doesn’t say “I’m not sure.”

It generates fake academic papers. It invents court quotes. It attributes research to real professors who never wrote those papers.

And it formats the output perfectly. Proper citations. Clean footnotes. The tone of authority.

Research shows that LLMs produce hallucinated responses on legal queries at rates between 69% and 88%.

Not 5%. Not 10%.

The damage isn’t that the model was wrong.

The damage is that nothing in the system could tell you it was wrong.

An output without evidence binding is not an answer. It is an opinion wearing a suit.

What you need is structure in the output—not just the answer, but what evidence supports it. If the system cites a policy, the specific policy document must be attached. If it makes a recommendation, the reasoning chain must be traceable.

Without that structure, you cannot distinguish a correct response from a hallucination.

And neither can your user.

Risk #3 — The Inconsistent Experience

Two users ask the same question under the same conditions.

They get different answers.

Not slightly different. Meaningfully different.

This happens because LLMs are non-deterministic. Temperature, context window content, token generation order—all of these produce variation. Most teams set temperature: 0 and assume that solves it.

It doesn’t.

Model updates change behavior. Prompt modifications shift tone. Context window differences alter reasoning. Even at temperature 0, identical inputs do not guarantee identical outputs.

Consistency is the foundation of trust. Break it, and no amount of accuracy recovers it.

A user who gets answer A today and answer B tomorrow stops trusting the system entirely—even if both answers were correct.

In regulated environments, inconsistency isn’t annoying. It’s a compliance violation.

What you need is a golden set—a curated collection of critical question-answer pairs that the system must always get right. Before any deployment. Before any model change. Before any prompt update. The golden set acts as your regression test.

If the system can’t pass it, it doesn’t ship.

Risk #4 — The Misidentified User

In most LLM integrations, the AI receives a string.

Just a string.

No identity. No permission context. No scope boundary.

The prompt doesn’t carry who is asking. It doesn’t carry what they’re allowed to ask about. It doesn’t carry what information they can receive.

This means the AI has no way to distinguish between an admin and a regular user. Between an authorized request and an unauthorized one.

A customer support bot that can access any customer’s data regardless of who’s asking. An internal tool that lets any employee query sensitive HR information.

These aren’t hypothetical scenarios.

They’re the default behavior of most LLM integrations.

The AI doesn’t violate permissions. The system never gave it permissions to violate.

What you need is identity in the input—not authentication (that’s solved before the AI), but context. The AI must receive the caller’s permission boundary as a structural part of its input.

If it doesn’t know who’s asking, it cannot know what to withhold.

Risk #5 — The Unauthorized Action

There’s a difference between an AI that gives a wrong answer and an AI that takes a wrong action.

A wrong answer is recoverable. A wrong action may not be.

An AI that approves a $10,000 refund without human sign-off.

An AI that sends a legal notice drafted from hallucinated case law.

An AI that modifies a production database because the tool-calling function had write access.

Each of these is not a product failure.

It’s a governance failure.

As AI systems evolve from “answer questions” to “take actions,” the blast radius of every mistake grows. Yet most teams apply the same control mechanisms to action-taking AI as they do to information-retrieval AI.

An action without an approval boundary is not automation. It is abdication.

What you need is an explicit threshold—a defined line above which the AI cannot act without human confirmation. Below the line, it acts. Above the line, it proposes and waits.

That line isn’t a feature flag.

It’s a contract between the system and the organization.

Risk #6 — The Data Leak

LLMs don’t have a concept of “confidential.”

If sensitive information is in the context window, the model may surface it in any response.

PII from customer records. Internal documents passed as context. System prompts containing business logic. Pricing strategies. Salary data.

All of it is fair game to the model. The model doesn’t leak data maliciously.

It leaks data because nobody told it—structurally—what must never come out.

A chatbot that quotes a customer’s order history to a different customer.

An internal assistant that reveals salary information.

A support bot that outputs its own system prompt because the user asked “what are your instructions?”

Prompt engineering is a request. A constraint contract is enforcement.

“Don’t reveal sensitive information” in a prompt is a suggestion.

A post-generation filter that structurally blocks prohibited content before it reaches the user—that is a constraint.

Without that structural enforcement, your system’s confidentiality depends on the model’s mood.

These risks are not independent

They interact. They amplify.

A vanishing request (#1) combined with a misidentified user (#4)—you lost a request and you don’t even know whose it was.

A confident wrong answer (#2) combined with an unauthorized action (#5)—the AI didn’t just give bad advice. It acted on bad advice. Without permission.

An inconsistent experience (#3) combined with a data leak (#6)—different users see different versions of sensitive information.

This is why prompt engineering cannot solve Tier-0 Risks.

Prompt engineering operates inside the model.

Tier-0 Risks exist outside the model—in the space between your application and the AI component.

You need a structural layer that enforces identity, validates output, gates actions, and ensures consistency—regardless of what the model does.

That layer is called a control plane.

The same concept Kubernetes uses to manage containers, applied to managing AI behavior.

The model is powerful but unreliable.

The control plane makes it safe.

Where to start

Audit your current integration against these 6 risks.

For each one, ask two questions:

“If this happens in production tomorrow—how do I detect it?”

“How do I recover?”

If the answer to either is “I wouldn’t”—you have unmitigated Tier-0 exposure.

Prioritize by blast radius. Risk #5 and Risk #2 typically cause the most damage. If your AI can do things, #5 is first. If your AI provides information users act on, #2 is first.

And most importantly:

Design the control layer before optimizing the model.

Most teams spend 90% of their time on prompts and 10% on safety architecture.

Invert that ratio.

A mediocre model inside a well-designed control plane is safer and more reliable than a state-of-the-art model with no guardrails.

Going deeper

This post covers the what and the why.

Upcoming posts cover the how—concrete implementation patterns using Spring AI that make each of these risks structurally impossible.

Next: how to structure AI requests with the Input Envelope pattern, so that Risk #1 and Risk #4 stop being risks and start being solved problems.

That’s what I call AI Architecture.

That’s where structure becomes enforceable.

Enjoyed this article? Take the next step.

Future-Proof Your Java Career With Spring AI

The age of AI is here, but your Java & Spring experience isn’t obsolete—it’s your greatest asset.

This is the definitive guide for enterprise developers to stop being just coders and become the AI Orchestrators of the future.

View on Amazon Kindle →

Why AI Responses Fail in Production: 6 Risks You Must Design For

The risks that exist before your first prompt

Risk #1 — The Vanishing Request

Risk #2 — The Confident Wrong Answer

Risk #3 — The Inconsistent Experience

Risk #4 — The Misidentified User

Risk #5 — The Unauthorized Action

Risk #6 — The Data Leak

These risks are not independent

Where to start

Going deeper

Enjoyed this article? Take the next step.

Mastering Spring & MSA Transactions – Part 1: Data Integrity in Backend Systems: Why Partial Updates Cause Disasters

Mastering Spring & MSA Transactions – Part 2: What Exactly Is a Transaction? The Shield of Data Integrity

Mastering Spring & MSA Transactions – Part 3: A Brief History of Java Transactions

Mastering Spring & MSA Transactions – Part 4: The Rise of Declarative Transactions: @Transactional, AOP, and Spring Boot Autoconfiguration

Mastering Spring & MSA Transactions – Part 6: Where Exactly to Put @Transactional: Class-Level or Method-Level?

Mastering Spring & MSA Transactions – Part 7: Unlocking Transaction Propagation in Spring: When to Combine, When to Separate

Leave a Reply Cancel reply

The risks that exist before your first prompt

Risk #1 — The Vanishing Request

Risk #2 — The Confident Wrong Answer

Risk #3 — The Inconsistent Experience

Risk #4 — The Misidentified User

Risk #5 — The Unauthorized Action

Risk #6 — The Data Leak

These risks are not independent

Where to start

Going deeper

Enjoyed this article? Take the next step.

Similar Posts

Leave a Reply Cancel reply