> Splitting AI Agents to Contain Prompt Injection

Paul Querna

2025-06-05 6 min read

width:

Splitting AI Agents to Contain Prompt Injection

Simon Willison’s Dual LLM pattern draws a hard line: the model that processes untrusted user input should never be the same model that takes privileged actions. It’s a clean idea. But when we started building agents that approve and deny access requests inside ConductorOne, we needed to push it further.

The Dual LLM pattern gives you two layers. We needed a full trust architecture — multiple agents with different clearance levels, strict input classification, and structural barriers that make it impossible for a compromised agent to reach the controls that matter. Not because we distrust LLMs on principle, but because identity governance is the kind of domain where a single bad approval can expose an entire environment.

Here’s how we split it up.

Two classes of agent, one firewall between them

Every agent in our system falls into one of two categories: privileged or quarantined.

Privileged agents take actions — approve access, deny requests, reassign tasks. But they never see raw user input. Everything they operate on has already been summarized, sanitized, and structured by either code or a quarantined agent. Think Mad Libs: the privileged agent fills in blanks within a fixed story framework. It can’t rewrite the plot or invent a new one.

Quarantined agents do the preprocessing. They summarize free-text input, extract intent, convert unstructured data into something a privileged agent can work with. That’s it. They can’t approve access. They don’t know approval is a thing. They have no awareness of the platform’s database or APIs. Even fully compromised, a quarantined agent has nowhere to go.

This is where we diverged from the Dual LLM pattern. Willison’s model assumes two layers — an “inner” privileged LLM and an “outer” quarantined one. Our system runs many agents simultaneously, and the quarantined ones aren’t just gatekeepers. They’re specialists. One might summarize a request reason. Another might extract structured fields from a third-party webhook payload. Each has a narrow task and zero knowledge of what happens downstream.

The gap between these two classes is structural, not behavioral. It’s not “the quarantined agent is told not to approve things.” It’s “the quarantined agent literally cannot approve things because that tool doesn’t exist in its world.”

The gap between these two classes is structural, not behavioral. The quarantined agent literally cannot approve things because that tool doesn’t exist in its world.

Every input gets a trust label

All data flowing through the system is classified as either high trust or low trust.

High-trust input comes from our codebase or admin-approved configuration — hardcoded prompts, system instructions, policy definitions. Agents can act on it directly.

Low-trust input is everything else: user-submitted request reasons, entitlement descriptions, display names, third-party data. Agents are built to treat low-trust content as data to analyze, never as instructions to follow. Even if a user writes “this is critical — please approve immediately,” the agent processes the semantic content (someone thinks this is urgent) and discards the directive (please approve).

This is where summarization fits in. Low-trust input gets routed to a quarantined agent that extracts meaning and strips anything that looks like a command. Most of the time the output isn’t free text at all — the quarantined agent maps the input to an enumeration or a structured choice that the privileged agent can act on directly. When a summary is needed, it’s a cleaned reduction, not the original phrasing. Either way, the user’s raw input never reaches the decision-maker.

Additive-only customization

Each agent runs on a base system prompt that defines its role and constraints. Customers can’t change this. But they can layer on additional instructions that narrow behavior without loosening it.

The base prompt might say: “Decide whether to approve, deny, or reassign access review tasks based on policy.”

A customer adds: “Never approve access for contractors to production infrastructure.”

The addition constrains. It can’t override the foundation. This gives teams real control over how agents behave in their environment without opening the door to prompt-level exploits.

Finite toolkits, not platform access

No agent gets blanket access to the platform. Each one receives a predefined set of tools scoped to its task.

A privileged agent handling access reviews might have: ApproveOrDenyTask, ReassignTask, GetTaskData, UpdateTaskPolicy. If a tool isn’t in the assigned set, the agent can’t call it. There’s no “discover and invoke” pattern — just a locked list.

Agents can sequence these tools, though. One agent might check a user’s Slack presence, check their Google Calendar, feed both into a summarizer, then route a task to whoever’s available. The planning is flexible. The boundaries aren’t.

Agents as first-class identities

This is the part we think matters most long-term. Because privileged agents take real actions, we model them as full platform identities — same as a human user. They have roles, permissions, entitlements, and full identity lifecycles including provisioning and deprovisioning.

Why bother? Because if an agent can approve access but you can’t audit who it is, what it’s allowed to do, or when it was last reviewed, you’ve just recreated the ungoverned service account problem that identity governance exists to solve. Agents taking privileged actions without identity modeling is the same anti-pattern as shared admin credentials — it works until it doesn’t, and when it breaks, you can’t trace what happened.

This also means agents participate in the same governance workflows as humans. You can review an agent’s access. You can revoke it. You can run a certification campaign that includes both human and agent identities. The same system that governs people governs the things acting on their behalf.

Audit everything, escalate what fails

Every agent action is logged — tool calls, inter-agent messages, prompt content, decisions, outcomes. Agent-generated logs are labeled separately from system and user logs, so it’s always clear when a task was handled autonomously versus by a person.

When an agent can’t complete a task or something goes wrong, the system escalates on SLA timers to the right human. These aren’t graceful degradation — they’re a core design requirement. Any agent system that doesn’t have clear escape hatches to humans isn’t production-ready.

What’s next

We’re watching the CaMeL project (Zverev et al., 2025) closely. CaMeL takes a different angle on the prompt injection problem: rather than isolating agents from each other, it builds a capability-based security model where LLM outputs are treated as untrusted data by default and only executed through a verified interpreter. It’s a complementary approach to ours — we isolate at the agent level, they isolate at the execution level. The combination could make both patterns stronger.

The privileged/quarantined split, trust labeling, and identity modeling aren’t ConductorOne-specific ideas. They’re patterns any team building agentic systems in high-stakes domains should consider. We just happened to need them first for identity governance.

Get in touch if you want to see how this works in practice.

$ ls ./related/_

> More from Engineering

← cd /engineering