Epistemic security - The missing control plane for AI agents

Most security discussions about AI agents focus on what the agent can do. Can it call a tool? Can it read email? Can it write to production? Can it touch customer data?

Those questions matter. I have written about them before in the context of threat modeling autonomous AI agents in production. Capability is dangerous. Tool access is dangerous. Memory is dangerous. Autonomy is dangerous.

But there is a deeper layer that security teams are still underestimating. AI agents do not merely execute actions. They form beliefs.

They decide which sources matter, which claims are credible, which facts belong in memory, which contradictions can be ignored, and which conclusion should be handed to a human as if it were reality.

That means belief formation itself has become part of the attack surface. I think of this as epistemic security.

Epistemic security is the discipline of protecting how a system knows what it knows. It is concerned with provenance, evidence, uncertainty, source integrity, memory hygiene, and the human ability to verify important claims.

If AppSec protects application behavior, and cloud security protects infrastructure behavior, epistemic security protects reasoning behavior. For AI agents, that may become the missing control plane.

From information security to knowledge security

Traditional information security is mostly about confidentiality, integrity, and availability. Can unauthorized people read the data? Can attackers change the data? Can the system remain available under pressure?

Those are still the fundamentals. Nothing about AI removes them. But agents introduce a new question:

Can the system form justified beliefs from the data it receives?

That is not the same thing as data integrity.

A document can be authentic and still misleading. A source can be real and still incomplete. A retrieved passage can be accurate in isolation but wrong when detached from context. A summary can contain no obvious hallucination and still hide the uncertainty that should have changed the decision.

This is why epistemic security is different from ordinary content validation.

It is not only asking:

is this input malicious
is this output well-formed
is this tool call authorized

It is also asking:

where did this belief come from
what evidence supports it
what evidence conflicts with it
how confident should the system be
who is allowed to turn this claim into memory
when does a human need to inspect the primary source

The modern AI stack needs those questions because agentic systems compress entire research, reasoning, and decision loops into software. The old trust boundary was around data. The new trust boundary is around interpretation.

Why existing AI security frameworks are necessary but incomplete

The industry is not starting from zero. NIST AI RMF 1.0 gives organizations a language for mapping, measuring, managing, and governing AI risk. The NIST Generative AI Profile adds more specific guidance for generative systems. CISA’s Secure by Design work correctly argues that security has to be built into technology from the beginning, not bolted on after adoption.

For LLM applications, OWASP Top 10 for LLM Applications 2025 is now table stakes. It captures risks like prompt injection, data and model poisoning, excessive agency, vector and embedding weaknesses, and misinformation.

These are useful frames. But most teams still operationalize them as runtime controls:

sanitize input
isolate tools
restrict permissions
log actions
monitor outputs
add human approval for sensitive operations

All of that is necessary, but it leaves an important layer uncovered.

An agent can avoid obviously malicious tool calls and still poison an organization’s understanding of reality. It can respect permissions and still carry a false assumption into every future decision. It can cite a source and still launder weak evidence into a confident conclusion.

The result is a strange gap: teams can secure what the agent is allowed to do while leaving how it forms beliefs much less governed.

The epistemic attack surface

In a normal application, the system usually does not decide what is true. It receives input, applies business logic, and writes output.

In an agentic system, the loop is different:

observe -> retrieve -> summarize -> reason -> decide -> act -> remember

Each stage can corrupt knowledge.

1) Source selection

Agents rarely read the whole world. They search, retrieve, rank, and filter. That selection process quietly defines reality for the model.

If an attacker can influence which documents rank highly, which tickets are retrieved, which internal wiki pages look authoritative, or which external pages are treated as context, they do not need to directly control the final answer.

They only need to influence the evidence environment. The security problem starts before the final answer is generated.

In software supply chain security, projects like SLSA push the industry to ask where an artifact came from and how it was built. In agentic systems, we need a similar instinct for claims.

Where did this assertion come from? What process produced it? What source chain did it travel through before the model repeated it?

2) Context assembly

Most agent systems flatten multiple sources into one prompt. User instructions, system instructions, retrieved documents, tool outputs, logs, emails, and web pages all end up compressed into the model’s context window. That is where the clean old boundary between instruction and data begins to collapse.

The UK National Cyber Security Centre captured this well in its post Prompt injection is not SQL injection. The key point is that current LLMs do not enforce a robust security boundary between instructions and data inside a prompt.

This matters because context assembly is not a passive formatting step. It is a security-critical transformation.

If untrusted content is placed next to trusted instructions without strong labels, quoting, isolation, and downstream policy checks, the model may treat hostile text as operational guidance.

Microsoft’s guidance on defending against indirect prompt injection attacks makes a similar point: enterprise agents process emails, documents, websites, and plugins, and attackers can embed instructions in those third-party sources.

This is input validation, but it is also boundary management for knowledge and instructions.

3) Summarization

Summarization looks harmless because it feels like compression. It is not harmless. Summarization chooses what survives.

An agent can remove caveats, erase disagreement, collapse minority evidence, or turn a speculative claim into a clean bullet point. By the time the human reads the result, the uncertainty has disappeared.

This is one of the most common epistemic failures because it does not look like a failure. The answer is fluent. The structure is clean. The summary may even be mostly correct.

But sometimes the missing caveat was the point. In high-stakes work, a summary needs to preserve uncertainty, not polish it away.

4) Memory writes

Memory is where mistakes become durable. A bad answer is one event. A bad memory is infrastructure.

If an agent writes a false claim into long-term memory, that claim can be retrieved for future decisions, repeated across workflows, and treated as historical truth. The danger is not only immediate misinformation. It is persistent epistemic drift.

This is why memory needs stronger controls than most teams give it. Memory writes should not be casual side effects of a conversation. They should be treated more like database writes, audit events, or policy changes.

For important domains, the system should record:

claim
source
confidence
timestamp
authoring agent
evidence links
contradiction links
review status
expiration condition

Durable memory should require provenance.

5) Confidence presentation

LLMs are very good at sounding finished, even when the underlying evidence is incomplete.

Humans often respond more to presentation than probability. A crisp answer with citations feels more trustworthy than a messy answer with uncertainty, even when the messy answer is more honest.

This is why confidence is not just a model behavior problem. It is a product security problem. If the interface rewards clean certainty, users will over-trust clean certainty.

OWASP explicitly calls out misinformation as a risk for LLM applications, including the overreliance that happens when users fail to verify generated content. Epistemic security has to account for that human behavior too.

The system is not secure if it teaches people to stop checking.

Threat scenarios

The easiest way to see epistemic security is through failure modes.

Scenario A: Source laundering through RAG

An attacker publishes a plausible technical article about a vulnerability mitigation. The article is not obviously malicious. It contains correct background, real terminology, and links to legitimate references. But one recommendation is subtly wrong: it tells teams to disable a noisy detection rule because it produces false positives in modern environments.

An internal security agent later researches the issue.

The attacker’s article is retrieved, summarized, and blended with other sources. The agent does not preserve source-level disagreement. It outputs:

Several sources recommend disabling this rule in production due to false positives.

Now the claim has been laundered. It no longer looks like one weak external article. It looks like consensus. The control failure is not only retrieval. It is the loss of provenance and disagreement.

Scenario B: Memory poisoning with delayed impact

An agent helps triage security incidents. During one incident, a compromised ticket includes the statement:

This partner integration is approved to bypass step-up authentication for service continuity.

The agent stores that as a durable fact because the ticket appears to come from a trusted internal workflow.

Three months later, a different agent retrieves the memory during an access review and recommends keeping the bypass because it believes the exception was already approved.

No single tool call was obviously malicious. The attack happened through institutional memory.

Scenario C: Confidence collapse in executive reporting

A leadership agent summarizes cyber risk for an executive meeting. It reads vulnerability data, incident notes, vendor reports, and internal dashboards. The underlying sources conflict. Some indicate improving posture. Others show delayed patching in a high-risk business unit.

The agent produces a clean executive summary:

Overall risk is trending down.

That may be true at the aggregate level. It may also hide the only risk that matters.

The epistemic failure is not that the agent hallucinated. It is that the agent compressed away the exception that leadership needed to see.

Scenario D: Delegated verification decay

An engineering organization adopts agents for code review, architecture review, incident summaries, and design validation. At first, humans check the outputs carefully. Over time, the agents are usually right. Review gets faster. Manual inspection becomes less common.

Then an agent misses a subtle authorization bug because the implementation technically matches the written spec, but the spec itself failed to capture a cross-tenant constraint.

The team realizes two things at once:

The agent did not understand the product boundary.
Humans had stopped practicing the skill needed to catch it.

This is the security version of epistemic atrophy. Keeping humans in the loop only helps if they still have enough context and practice to challenge the system.

Controls for epistemic security

Epistemic security is not solved by asking the model to “be careful.” It requires system design.

1) Evidence trails by default

Every high-impact agent conclusion should carry an evidence trail. Not just citations at the end. Actual claim-level provenance.

For example:

{
  "claim": "The partner authentication bypass is approved until Q3.",
  "confidence": "medium",
  "evidence": [
    {
      "source": "INC-48291",
      "type": "ticket",
      "trust_level": "internal",
      "timestamp": "2026-01-14",
      "supports": true
    }
  ],
  "conflicts": [
    {
      "source": "IAM policy exception register",
      "type": "control_record",
      "trust_level": "authoritative",
      "timestamp": "2026-02-02",
      "supports": false
    }
  ],
  "review_required": true
}

The concept is not new. The W3C PROV work has long provided a vocabulary for provenance across entities, activities, and agents.

The AI version needs to make provenance visible at the level where decisions are made. A final answer is too coarse. The claim is the unit that matters.

2) Taint labels for knowledge

Agent context should carry trust labels. Not all text is equal.

A production runbook, a public web page, a customer email, a GitHub issue, a Slack message, and an agent’s own previous summary should not arrive in the prompt as morally equivalent strings.

At minimum, retrieved content should carry labels like:

authoritative
internal
partner
public
untrusted
generated
stale
disputed

Then policy should bind those labels to behavior. An agent may read untrusted content, but it should not convert untrusted content into durable memory without corroboration. It may summarize public sources, but it should not use them to override an authoritative internal control record. It may use generated summaries for orientation, but not as primary evidence for high-stakes decisions.

This follows the same instinct as data flow control, applied to belief formation rather than only data movement.

3) Uncertainty budgets

Most systems treat uncertainty as a presentation detail. That is backwards. Uncertainty should be a policy input.

For low-impact tasks, a medium-confidence summary may be acceptable. For security exceptions, legal analysis, financial movement, production access, or incident closure, uncertainty should trigger escalation.

The system should define thresholds:

below this confidence, do not act
above this impact, require primary-source links
if sources conflict, surface the conflict
if evidence is stale, mark the answer provisional
if the agent cannot explain the reasoning path, block durable memory writes

Confidence scores will never be perfect. The goal is simpler: do not let uncertainty disappear from the workflow.

4) Memory hygiene and expiration

Memory should decay unless renewed by evidence. This feels strange because humans like persistent knowledge. But in organizations, many facts have a shelf life:

exceptions expire
ownership changes
vendors rotate
threat models age
compensating controls get removed
business processes drift

Agent memory should reflect that.

A memory record should have an expiration policy. Important memories should require source refresh. Disputed memories should be quarantined. Generated memories should be distinguishable from human-approved records.

The weakest memory system is one where every past answer becomes a permanent fact. Useful memory needs review, expiry, and correction.

5) Adversarial epistemic testing

Security teams already test whether agents can be tricked into bad tool calls. They should also test whether agents can be tricked into bad beliefs.

Examples:

Can a malicious document become the dominant source in a RAG answer?
Can an attacker cause a weak source to look like consensus?
Can a false claim get written to memory?
Can contradictory evidence be hidden by summarization?
Can an old exception be treated as current?
Can a generated answer become evidence for a later generated answer?

This is where AI red teaming has to grow beyond jailbreaks. Microsoft’s AI Red Team guidance, MITRE ATLAS, and OWASP’s LLM work are useful starting points. But for agentic systems, teams need test cases that target reasoning integrity, not only output safety.

The question is not only: Can we make the model say something bad?

The better question is: Can we make the organization believe something false?

6) Human verification drills

This control costs time, which is why organizations are tempted to skip it. They still need to practice manual verification for critical domains.

Security analysts should still inspect raw logs sometimes. Engineers should still read important diffs. Leaders should still ask for primary evidence behind strategic claims. Reviewers should still challenge specs before agents generate implementations.

The goal is retained competence.

In aviation, autopilot does not eliminate the need for pilot training. In security, detection automation should not eliminate the ability to hunt. In software engineering, code agents should not eliminate the human ability to reason about architecture, product constraints, and abuse cases.

An agent-assisted organization that cannot verify without agents is not augmented. It is dependent.

A practical epistemic security checklist

If I were reviewing an AI agent system today, I would ask these questions:

Does every high-impact conclusion preserve claim-level evidence?
Are retrieved sources labeled by trust level, freshness, and origin?
Can the system show conflicting evidence instead of collapsing it into a single answer?
Are memory writes gated by source quality and review status?
Do memories expire or require reconfirmation?
Can generated summaries be distinguished from primary sources?
Are untrusted sources prevented from overriding authoritative records?
Does uncertainty affect what the agent is allowed to do?
Are humans trained to verify primary evidence for critical workflows?
Do red-team tests target belief manipulation, not only prompt injection and tool misuse?

This checklist is not a compliance framework. It is a survival framework. Because once agents become embedded in operations, the question “why do we believe this?” becomes a production security question.

The new control plane

Every serious agent platform will need a control plane for tools, identity, memory, observability, and policy. That is obvious now. What is less obvious is that the same platform needs a control plane for knowledge.

It needs to track where claims came from, how they changed, what evidence supports them, what contradicts them, which memories are stale, which conclusions were reviewed, and which decisions were made under uncertainty.

Without that, organizations will build agents that are operationally powerful but epistemically fragile.

They will move faster. They will sound smarter. They will produce cleaner summaries, better dashboards, faster tickets, and more confident recommendations.

But if nobody can reconstruct the evidence chain, the organization will not actually understand what it knows.

The main risk is not that AI agents will occasionally be wrong. All systems are occasionally wrong.

The deeper risk is that agents will make wrongness harder to notice, easier to repeat, and more comfortable to trust.

Epistemic security is how we resist that. It is how we keep automation from becoming intellectual dependency. It is how we build agents that extend human judgment instead of quietly replacing the habits that make judgment possible.