Agent Native·2026

Agentic SaaS Playbook 2026

Agent Native

18 partsLast updated: Feb 2026

1. Introduction

Pocket guide for shipping agentic SaaS products

Disclaimer

I'm not affiliated with any of the companies or products mentioned in this book. Information is accurate to the best of my knowledge as of February 2026, and please note that products and policies change.

If you're looking for a specific playbook, additional research, or hands-on support building your product, you can write me at agentnativedev (at) gmail (dot) com.

For a long time, I was looking for serious resources on building agentic products, resources for people who actually want to ship products that work and loved by users. I paid for courses, books, communities, and programs hoping one of them would go deep enough.

A lot of courses, books or content show you a clever workflow, maybe a framework, and then quietly skip the hard parts: identity, permissions, billing, retries, monitoring, data ownership, and the boring realities of operating software for customers.

And that gap is where most of the real work lives.

Users will stress-test your edge cases, your costs, your uptime, and your trust boundaries, and they will hold you accountable.

That's the part I needed help with when I started.
That's the part I couldn’t find.
So that's the book I wrote.

If you follow the sections at your own pace and build alongside the repository, you'll understand what it takes to design, implement, and operate a SaaS product with agentic capabilities.

You'll build a Deep Research Agent that runs inside a real SaaS product: a public landing page, SEO optimized blog, a protected workspace, billing, recurring automation, and a backend that enforces identity and ownership, something you can actually put in front of customers.

Throughout the book, I'll use planes as a practical way to reason about the system: UX, orchestration, runtime, memory, data, integrations, security, and observability.

Each plane is where specific failure modes show up, and where specific investments pay off.

Here's what you'll end up with:

A public SaaS surface (landing, docs, pricing) that transitions cleanly into a protected workspace.
A protected product area where users launch runs, approve steps, ask follow-ups, and download reports.
A full run lifecycle: search → collect sources → summarize → approval gates → memory indexing → PDF report.
Monitors that schedule recurring research runs like a lightweight operations layer.
Stripe-based entitlement gating so runtime access is enforced server-side, not just in the UI.
A mock vs live mode switch so you can demo and test fast without breaking production contracts.

This book, and the assets that come with it are for people building a product that works, that people trust and that lasts.

I hope you enjoy the freely available sections and deep dives. They're not light previews, they're genuinely substantial, and I put as much emphasis on them as I did on the gated sections. They're absolutely worth studying and practicing.

And if you're serious to take things a lot further, then I'd love for you to join The Product Foundry. It's an exclusive membership for builders who want to ship products the right way, with depth and discipline. I hope to see you inside.

2. Architectural Planes of Agentic SaaS

To keep things concrete, we'll use a mental model I'll call planes: layers of a deployable system that each carry their own failure modes, costs, and design constraints:

The planes below are a way to keep structure legible. Each plane is where different constraints dominate and different build vs buy decisions make sense.

UXWhere trust is won or lost, what users see, approve, and download.
ControlDeterministic execution substrate: routing, state, tool contracts, gates
RuntimeExecution environments, queues, workers, timeouts, and cost control.
MemoryWhat you store, how you index it, and how you ground Q&A.
DataPersistence, schemas, ownership, and multi-tenant boundaries.
IntegrationsWeb search, scraping, tools, external APIs, and their drift over time.
SecurityIdentity, authorization, policy checks, input/output guardrails.
ObservabilityLogs, traces, metrics, evaluation, monitoring, debugging, cost attribution

Knowledge Check

Which plane most directly reduces the “articulation burden” of chat-only products?

User Experience Plane (UI/UX)

Where trust is won or lost, what users see, approve, and download.

Most agentic products fail because the experience makes the smart thing hard to access, hard to trust, or hard to justify.

In practice, UX breakdowns very quickly compound.

Small friction points across onboarding, mental models, guardrails, and handoffs add up to a slow-motion collapse.

That's why the UI/UX plane is your strategy for adoption, cost control, and credibility.

SaaS framing in this blueprint

In this blueprint, we are building a Deep Research Agent as a SaaS product. That changes the engineering bar where each run is user-scoped, credit-gated, observable, and restart-safe.

We are designing a deployable SaaS product with a full user journey: acquisition, authentication, protected workspace, run lifecycle visibility, and billing transparency.

A good rule of thumb for 2026: start with the user, not the algorithm.

Before you choose model stacks or “agent frameworks,” look at how people do the job today:

What are they trying to accomplish?
What feels slow, risky, or annoying?
Slow: too many steps
Risky: easy to make mistakes
Annoying: context switching
Where do they already live (Slack, email, CRM, ticketing systems)?
Start with the smallest change that helps
Even if that change is non-AI

It's tempting to reach for the newest tech by default but most products don't need it.

And even when AI is involved, you should sell and design the product around user outcomes, not model magic. More often than you'd expect, a lightweight automation will solve the problem better (and cheaper) than wrapping everything in a heavy LLM call.

Borrowed surfaces beat bespoke UIs early

In the earliest stages, frontend tax is real and building bespoke UIs can delay the launch by months.

So teams often meet users where they already work: Slack, Microsoft Teams, email, or the system-of-record (e.g., an ITSM tool, CRM, or ticketing platform). You get instant distribution, familiar interaction patterns, and you can iterate quickly.

Example: Atomicwork

A good example is Atomicwork's agentic service-management platform launched with Slack and Microsoft Teams integrations. Employees can interact through Slack while the service management layer and agentic workflows run behind the scenes.

Source: Atomicwork – Modern ITSM solution

Example: GTM Buddy

Another example is GTM Buddy: agents meet end-users where they work by embedding into Salesforce, Gmail, Outlook, Slack, and Teams—so users don't have to toggle between tools.

Source: GTM Buddy

This pattern is especially sensible if you're racing toward a funding milestone: you can prove value without building a UI fortress first.

How we applied this in our MVP

We intentionally shipped a hybrid UX from the start: strong public surface for positioning and education, plus a protected workspace for governed operations.

UX surface split (implemented)

// public UX
/ -> landing
/blog -> thought leadership
/docs -> implementation orientation
/pricing -> conversion

// protected UX
/login -> auth funnel
/research (and /admin alias) -> deep research workspace
/billing -> subscription + credits visibility

UX trap

Chat is an amazing entry point but a weak information architecture. Users must (1) know what the system can do and (2) express it well. That “articulation burden” is where ROI quietly dies.

People either don't ask, ask the wrong thing, or don't trust what they get back.

Borrowed surfaces also come with a hidden invoice:

You inherit the platform’s interaction model, which is great for quick intake but weak for complex work
You inherit the platform’s constraints. Data retention rules, UI limitations, API policy changes.
You risk becoming “a bot” instead of “a product.” Users don’t build trust in bots the way they build trust in tools.

Here are some design moves that can reduce early-stage failure without building a full UI:

Assist + handoffDon't try to “replace experts.” Draft the response, propose next steps, show reasoning, ask for approval.
Make cost visibleAdd friction where it matters (confirmations, approvals, scoped actions) and remove it where it doesn't (quick replies).
Micro-interactionsButtons, forms, menus, and “suggested next steps” become lightweight IA inside chat for repeatable tasks.

A mature pattern is hybrid where you keep Slack/Teams as the fast first touchpoint, and provide a dedicated web/mobile experience as the trustworthy “control center” for governance, persistence, multi-step workflows, and differentiation.

Beyond bots: dashboards, workflows, and trust for growth

Chat can handle intake but operations require structure.

Support workflows are the clearest example, e.g. Zendesk's Slack integration lets teams create tickets inside Slack via shortcuts/actions but resolution still lives inside the structured system behind it.

This is why, as startups grow and move into Series B, they often develop dedicated agent interfaces. The bottleneck shifts from “can we ship?” to “can users reliably get value every day?”

Once you're there, richer patterns become worth the effort:

Multi-step workflows with checkpoints
Persistent histories and task states.
Approvals, audit trails, and governance
Personalization and role-based views
Multi-agent coordination (handoffs between agents and humans)

Implementation note

As you will later see, these patterns also appear in the product we are building

Persistent histories: run list + events + report artifacts in /research.
Governance: approval pause flow (waiting_approval) + resume/reject actions.
Observability UX: runtime inspector with task DAG, snapshots, and memory facts.
Role workflows: datasets/methodologies/monitors side panels as structured operations.

Example: Relevance AI

Relevance AI is one example of the “agent OS” direction with dedicated surfaces plus integrations, positioning agents as managed workforce components, the team raised a US$24M Series-B round late 2024.

Source: Relevance AI

You can also see more of predictive onboarding, chat-based dashboards, and voice assistants in areas such as fintech or healthcare. You can go much beyond simple Slack bots and build dashboards that display agent state, analytics, and contextually appropriate actions.

Source: Tableau

Caveat: accuracy metrics don’t equal value

If UX causes unnecessary escalations, redundant work, or endless back-and-forth, the human cost swamps the automation benefit, like a fraud system that flags everyone and creates a manual-review backlog.

The interface has to shape demand (good defaults, constrained choices, clear escalation paths), not just answer questions.

Dedicated UI is control over:

Persistence
Governance
Differentiation
Mental model users learn

Two UX practices that pay off

Storyboard firstBefore you pixel-push, sketch the flow: who's the protagonist, what triggers the interaction, what success looks like, and where the agent must not operate autonomously. It's the fastest way to catch wrong use case problems early.
Agent narrativeDesign the story users experience: what the agent can do, what it's doing now, what it needs from them, and what happens next. Without a narrative, even a capable agent feels random.

Platform dependency risk

A practical reason to invest in your own UI at Series-B is dependency risk. The more your product's core value depends on someone else's UI + data pipes, the more your roadmap inherits their constraints (policy changes, API limits, data-handling rules, UI affordances).

Hybrid wins at Series-C and enterprise

By Series-C (and certainly in enterprise), the winning pattern is almost always hybrid:

KeepSlack/Teams for speed, reach, and convenience
AddA dedicated web/mobile experience for depth, governance, and differentiation

How this maps to our current SaaS UX architecture

Fast door: public pages and login funnel reduce acquisition and onboarding friction.
Control center: protected /research workspace with run launch, runtime visibility, and artifacts.
Governance center: /billing page exposes subscription state, credits, and recent events.
Route protection: middleware + backend auth keep private operations inaccessible to anonymous traffic.

Example: Moveworks

Moveworks is a well-known example of an agentic IT support surface that works through enterprise messaging tools like Microsoft Teams and Slack for convenience, while still supporting richer, structured experiences for ticketing, workflows, and analytics.

Custom web and mobile dashboard for IT support tickets

Source: Moveworks

Example: Intercom + Slack

Established SaaS products like Intercom also shows the same maturity curve. Intercom's support system, powered by its Fin AI agent, connects Slack channels so that support agents see tickets in both Intercom and Slack, with real‑time sync of statuses and conversation histories.

Source: Intercom — Connect your Slack channel

What changes in UX emphasis (late stage)

ExpectationsUsers need a simple, explicit contract, i.e. what the agent can do, what it can't, and what it will ask before acting.
TransparencyShow sources, assumptions, and the “because” behind recommendations to earn trust.
Error handlingSafe retries, graceful fallbacks, clear escalation, and “undo” matter more than cleverness.
AutonomyStart with suggestions, then move to guarded execution, then expand autonomy as trust and reliability prove out.
Data restraintCollect less by default, ask permission when you must and be explicit about retention and access. Users notice.

UX maturity curve (typical evolution)

Start with borrowed surfaces (Slack/Teams/CRM) to validate value quickly.
Add lightweight IA inside chat (structured actions, suggested flows, guardrails) to reduce articulation burden and misuse.
Build a dedicated agent hub for persistence, multi-step workflows, governance, and brand differentiation.
Land on a hybrid model where chat is the fast “front door,” and your UI is the trustworthy “control center.”

If you're building agentic SaaS, the key is not picking “Slack bot vs bespoke UI” as a permanent identity. It's recognizing where you are on the maturity spectrum, then designing the smallest UX system that makes:

Value discoverable
Outcomes trustworthy
Human cost aligned with business benefit

Developer experience is part of UI/UX

If you're building an agentic platform (not just an app) where developers consume your services, your “UI plane” includes developer-facing surfaces too:

SDKs
CLIs
Self-service panels
Workflows
Admin controls
Diagnostics
Rollout tooling
Support

This collection of deveoper surfaces make the system feel predictable, otherwise adoption dies when the platform behaves like a black box and you can't scale the support over time after initial roll-out.

You often optimize for four developer outcomes: time-to-first-success (onboarding), time-to-confidence (predictability), time-to-debug (observability), and time-to-recover (safe change + rollback). The companies below win because they compress those timelines aggressively.

DX patterns that correlate with adoption

Integration contractMake the platform's behavior legible: inputs/outputs, tool scopes, permissioning, rate limits, cost signals, and failure modes. Developers ship faster when the “contract” is clear enough to reason about and test.
TraceabilityGive developers run traces they can debug: tool calls, retrieved context, state transitions, approvals, retries, and where the agent asked for clarification. If developers can't explain an outcome, they won't trust it in production.
Safe sandboxesProvide environments where teams can test with real-ish data without real-world blast radius: replay, simulation, dry-run modes, and one-click rollback. The goal is “learn fast” without “break prod.”
Opinionated primitivesShip reusable UI + API building blocks for “agent status,” approvals, citations, human handoff, feedback, and undo—so every team doesn't reinvent unsafe patterns with slightly different failure modes.

These patterns reduce the cognitive load of shipping agentic behavior. The goal is to make the “right” path the easiest path, and make risky moves feel obviously risky before they hit production.

Let's have a look at a few examples.

Example: Vercel previews as the contract

Vercel turned deployments into a default workflow. Every branch gets a preview environment. That's a DX pattern agent platforms should steal. Make “safe testing” the path of least resistance, and tie it to the habits developers already have (Git → environment → feedback → merge).

Source: Vercel Git integrations

This removes the hidden tax of “setting up a place to test.” Preview environments compress feedback loops and eliminate coordination overhead (no shared staging fights, fewer “works on my machine” debates). It also boosts developer happiness because progress becomes visible. URL you can share is instant social proof, and it aligns engineering with product/design review without extra ceremony.

In agentic platforms, previews matter even more because behavior is probabilistic. If every iteration requires a full production-like release, developers become conservative and slow. Previews let teams explore safely where they can tune prompts, adjust tool scopes, refine guardrails, and do it with realistic integration context.

Example: Supabase schema as source-of-truth

Supabase leans into Postgres as the contract where local dev workflows and migrations keep behavior consistent across environments, and generated types make mismatches show up early. You can similarly treat “what's allowed” as a first-class artifact.

Supabase treats Postgres as the contract

Source: Supabase Local development

This effectively reduces onboarding friction. When the “contract” is your schema, the platform becomes teachable where new developers can infer behavior from types, tables, and migrations instead of reading Slack threads.

For agents, the schema analogy actually extends beyond data. You want “behavior schemas” too with tool input/output definitions, allowed action scopes, escalation rules, and approval boundaries. The more of that you can represent as structured artifacts (and validate in CI), the less your platform depends on hero engineers.

Example: AWS permissioning you can test

AWS ships an IAM policy simulator so teams can validate and troubleshoot authorization rules before rolling changes out.

Source: AWS IAM policy simulator

Permissioning is where agentic platforms either become enterprise-grade or become a toy. Developers don't fear complexity, they fear invisible complexity. Testable permissions reduce anxiety because teams can answer “who can do what, under which conditions” without guessing.

This also sets up your traceability story. Once you have a clear contract and enforceable boundaries, the next productivity bottleneck is debugging: when something goes wrong, can a developer see exactly where the contract was violated (or where the world didn't match expectations)?

Example: Stripe Request logs as debugging UX

Stripe makes request logs a first-class developer surface. You can inspect what was sent, what was returned, and what failed. That's also how you should treat run traces.

Source: Stripe Request logs

Logs reduce time-to-debug. More importantly, they reduce onboarding time because they teach developers how the platform behaves in the real world rather than idealized “happy path.” In agent platforms, logs should show not just errors, but also intent, e.g. tool calls attempted, inputs used, scopes applied, and what guardrail blocked an action.

Still, logs alone often answer “what happened,” not “why did it happen.” That's where traces and causal timelines become the difference between a confident developer and a frustrated one.

Example: Sentry traces + breadcrumbs for causality

Sentry's Trace View and breadcrumbs are designed to answer the only question that matters during incidents: “what happened, in what order, and why?” You need the same ergonomics, e.g. timelines of tool calls, state changes, user approvals, and failures so teams can debug behavior.

Sources: Sentry Trace View

This is also where developer happiness shows up as a measurable operational outcome: lower MTTR, fewer escalations, and less “psychological load” during incidents. A good trace UI turns debugging into navigation.

Developers stop asking “is the model broken?” and start answering “the tool call failed because scope X blocked it,” or “retrieval returned stale context,” or “approval step was skipped due to misconfiguration.”

Once you can explain behavior, the next constraint becomes iteration speed. The best debugging tools in the world won't help if every fix requires a risky production deploy. That's why developer-first companies obsess over safe, realistic sandboxes.

Example: Cloudflare preview URLs + local dev loops

Cloudflare Workers supports preview URLs and local dev workflows so developers can iterate quickly without pushing risky changes straight to production. You should provide the teams a tight feedback loop and safe promotion paths.

Sources: Cloudflare Previews

Sandboxes increase velocity because they turn experimentation into a default behavior. When developers can replay inputs, simulate tool failures, and test different guardrails quickly, they converge on reliable designs faster and ship with less fear.

But even with great previews, eventually you have to ship to production. That's where rollout UX matters, you need confidence-building mechanisms that let teams deploy agent behavior changes without betting the company on a single release.

Example: AWS Lambda canary rollouts with weighted aliases

AWS Lambda weighted aliases allow gradual traffic shifting to new versions with quick rollback. This is the rollout UX agent platforms should standardize: promote behavior changes through controlled exposure, not big-bang releases.

Source: AWS Lambda alias routing

Canary rollouts reduce change failure rate and that translates directly into developer trust. Teams become willing to ship improvements because the blast radius is explicit and controllable. In agent platforms, this is especially important because behavior changes can alter cost, latency, and user trust in one move. Controlled exposure makes those tradeoffs observable before they become widespread.

Still, version-based rollout is only half the story. The other half is feature-level control, the ability to turn behaviors on/off, segment users, and iterate safely without coupling every tweak to a deployment artifact.

Example: LaunchDarkly progressive delivery as a default habit

LaunchDarkly's percentage rollouts and staged releases as normal operating procedure. For agent systems, progressive delivery is how you prevent a new planner from doubling your escalations overnight.

Source: LaunchDarkly Percentage rollouts

When feature-level controls exist, teams can test hypotheses (“does this guardrail reduce escalation?”). It also improves developer happiness because the platform supports reversible decisions, you can experiment without fear of being trapped by a release.

The missing piece for many agent platforms is deterministic testing. Progressive rollouts help in production, but you still want a way to validate integration flows repeatedly without triggering real-world side effects.

Example: Twilio test credentials for deterministic outcomes

Twilio's test credentials let teams exercise integration flows without triggering real-world side effects. That's the sandbox pattern agent platforms should copy, i.e. preserve the shape of production behavior while making consequences safe and repeatable.

Source: Twilio Test credentials

Determinism reduces onboarding friction because it makes learning reproducible, new developers can run the same scenario and see the same outcomes. It also reduces operational risk because teams can build strong regression tests around “known bad” cases, exactly what agent platforms need when behavior depends on tool availability, permissions, or shifting context.

Once you can test safely, you can go further, make “preview before action” a platform primitive. This is how you prevent costly or irreversible automation mistakes, and it's a huge trust builder for both developers and operators.

Example: Terraform dry-run diffs as a primitive

Terraform's plan makes “show me the diff before you apply it” the standard workflow. For agent platforms, this maps to previews of actions: what will change, what tools will run, what data will be touched, and what the rollback path is.

Source: Terraform plan

Previews increase developer confidence because they turn execution into an informed decision. In agent systems, “plan mode” is how you make autonomy legible. You can show the proposed tool calls, what data will be written, what permissions are required, and which steps require approval. This reduces incident volume and makes it easier for developers to defend the platform internally.

Now zoom out and notice that all of the above assumes developers can get started quickly. But many platforms lose adoption before they even reach debugging or rollout, because initial setup feels like homework. Developer-first companies treat onboarding as a first-class performance problem.

This is velocity through momentum. When teams can go from “new project” to “working demo” quickly, they develop attachment to the platform. It increases developer happiness because it respects their time and it increases adoption because stakeholders see results early. For agent platforms, a prebuilt scaffold might include tracing enabled, safe sandboxes configured, starter guardrails, and a “hello world” tool call that demonstrates approvals + undo.

Finally, none of this matters if platform evolution is painful. Developers stick with platforms that let them upgrade without dread and that requires versioning discipline. You want change to feel intentional, observable, and reversible.

Example: Stripe — versioning to prevent breaking integrators

Stripe's webhook versioning guidance is a model for platform evolution without chaos. Agent platforms need the same discipline, i.e. behavior changes should be versioned, observable, and migratable so adopters move on their schedule, not yours.

Source: Stripe Webhook versioning

Developer trust is earned through predictable change. When platform behavior is versioned and observable, upgrades stop being high-stakes events and become routine maintenance. That's how you preserve velocity over time and it's a major driver of developer satisfaction with less surprises, less firefighting and more forward progress.

DX maturity curve (how platforms usually evolve)

What “good DX” looks like as you scale

Start with a minimal contract (SDK + clear limits + observability) so early adopters can ship without guessing.
Add first-class traces and safe sandboxes so debugging and iteration become routine, not heroic.
Introduce progressive delivery (canaries, rollbacks, versioning) so behavior changes don't become support incidents.
Standardize primitives (approvals, citations, handoff, undo) so every integration doesn't reinvent unsafe UX.

The point isn't to copy these companies verbatim. It's to copy their underlying move. Treat developer workflows as product UX. When developers can predict, test, observe, and recover from agent behavior changes, they ship the platform and they keep it shipped.

This should give you enough perspective about approaching UI/UX for your own solution, platform or product. In the next section, we will dive deep into control plane.

Control Plane

The governance layer that decides what agents can do, what they cannot, and who is watching

The control plane is the nerve center of an agentic SaaS system. It does not run agent logic, that belongs to the runtime plane. Instead, it decides whether an agent step should execute, which tools an agent can reach, how much it is allowed to spend, and what happens when something goes wrong.

If the runtime plane is the engine, the control plane is the cockpit: instrument panel, throttle levers, and circuit breakers.

Agent control plane is an emerging architectural plane and it is distinct from the build plane (frameworks, models) and the orchestration plane (workflow engines) whose job is to provide unified visibility, governance, and management across a heterogeneous agent estate. Enterprises already rely on out-of-band control planes in other domains (Airbnb's experimentation guardrails, JPMorgan's model risk governance) and that agents demand similar independent oversight.

Over the next 12-24 months, this will solidify into a distinct market with dedicated vendors.

SaaS framing: the control plane is your trust contract

In a multi-tenant SaaS product, the control plane is what lets you promise customers: "Your agent will never exceed your token budget, will never call an unapproved tool, and every action is logged for audit."

This blueprint implements a control plane that separates policy from execution. Policies are declared as data (YAML/Cedar), evaluated at the middleware layer before every tool call, and enforced deterministically.

The same pattern powers AWS Bedrock AgentCore's governance layer, which enforces hard constraints at the infrastructure level rather than relying on prompt-level instructions.

Plane separation: why the control plane must live outside the runtime

A recurring architectural mistake is embedding governance logic inside agent prompts or runtime code. "Don't call the delete API" in a system prompt is a suggestion that can be overridden by jailbreaks, prompt injections, or model updates.

Real governance must sit in a separate process that intercepts agent actions before they execute and applies deterministic rules the agent cannot bypass.

This mirrors how service meshes (Istio, Envoy) enforce network policies without modifying application code, and how Kubernetes admission controllers validate resources before the API server persists them.

The Deep Research blueprint organizes governance across four planes. Each has a distinct responsibility and a separate failure domain:

Plane	Responsibility	Industry analogues	Blueprint implementation
Control	Policy, identity, tool registry, budget enforcement, progressive delivery	K8s API server, Istio control plane, AWS Bedrock AgentCore governance layer	Policy middleware, tool registry YAML, budget tracker, feature flags
Runtime	Execute agent steps: LLM calls, tool invocations, code sandboxes	Lambda, Temporal, E2B, Modal	Workload lanes, durable workers, sandbox broker
Memory	Persist context across sessions: conversation history, knowledge, embeddings	Mem0, Zep, LangGraph checkpointing	Session store, vector DB, episodic memory
Data	Source-of-truth storage: documents, structured data, indexes	PostgreSQL, S3, Pinecone, Elasticsearch	Document pipeline, search index, metadata DB

control_plane/middleware.py - policy enforcement as middleware

python

class ControlPlaneMiddleware:
"""
Sits between the agent loop and the runtime.
Every tool call passes through here BEFORE execution.
The agent cannot bypass this—it is architecturally impossible.
"""

def __init__(self, policy_engine, tool_registry, budget_tracker, audit_log):
    self.policy_engine = policy_engine      # CEDAR / OPA / YAML rules
    self.tool_registry = tool_registry      # allowed tools + schemas
    self.budget_tracker = budget_tracker     # token / dollar limits
    self.audit_log = audit_log              # immutable event stream

async def evaluate(self, agent_id: str, action: ToolCallRequest) -> PolicyDecision:
    """
    Called before every tool invocation.
    Returns ALLOW, DENY, or REQUIRE_APPROVAL.
    """
    # 1. Is this tool registered and enabled for this tenant?
    tool = self.tool_registry.resolve(action.tool_name)
    if not tool or not tool.enabled:
        return PolicyDecision.deny(f"Tool '{action.tool_name}' not in registry")

    # 2. Does the agent's identity have permission for this action?
    authz = await self.policy_engine.evaluate(
        principal=agent_id,
        action=action.tool_name,
        resource=action.resource,
        context=action.context,
    )
    if authz.decision == "DENY":
        return PolicyDecision.deny(authz.reason)

    # 3. Would this call exceed the tenant's budget?
    budget_check = self.budget_tracker.check(
        tenant_id=action.tenant_id,
        estimated_cost=action.estimated_tokens,
    )
    if budget_check.exceeded:
        return PolicyDecision.deny(f"Budget exhausted: {budget_check.remaining} tokens left")

    # 4. Does this action require human approval?
    if authz.decision == "REQUIRE_APPROVAL":
        return PolicyDecision.require_approval(authz.reason)

    # 5. Log and allow
    await self.audit_log.record(agent_id, action, "ALLOWED")
    return PolicyDecision.allow()

System prompts are suggestions to a probabilistic model. They can be overridden by prompt injection, ignored during multi-step reasoning, or invalidated when you swap models. A middleware-based control plane is deterministic: it runs as code that the agent cannot influence.

AWS Bedrock AgentCore makes this explicit, policies like "this agent cannot delete objects in the production S3 bucket" are enforced at the infrastructure layer, not embedded in prompts. For auditors, this is the difference between "we hope the AI behaves" and "we can prove it can't misbehave."