Agent Native·2026

AgentOps: Operating Discipline for Managing AI Agents at Scale

Agent Native

46 partsLast updated: Mar 2026

Introduction

It's 4 PM on Friday. Let me tell you how your weekend is about to go.

A war story to kick it off

It's 4 PM on a Friday.

Your AI sales agent just offered your largest Fortune 500 client a 50% discount on the enterprise contract without authorization. The agent was running autonomously, had read-access to your CRM, and decided, based on the customer's churn risk score, that this was the right move.

It was not the right move. The deal is now technically binding under the terms your legal team spent six months negotiating.

This scenario, documented by Composio in their 2025 incident report, is not hypothetical anymore. It's the kind of thing that happens when you deploy an autonomous agent into production without the operational scaffolding to constrain what it can do, log what it did, and let a human intervene before the blast radius expands.

Happy Friday.

What is AgentOps, really

Andrej Karpathy has a useful framing here: we have the LLM kernel, but no operating system. The model itself is maybe 5% of what you actually need to ship. That also tracks with what Google found in their foundational "Hidden Technical Debt in Machine Learning Systems" paper.

At Google scale, the actual ML model code represents less than 5% of a production AI system. The other 95% is data pipelines, feature stores, serving infrastructure, monitoring, testing, and all the unglamorous scaffolding that actually makes it work when nobody's watching.

AgentOps is that 95%. It's the operating discipline, not a product or a platform, that brings platform engineering rigor, distributed systems reliability, human-centered governance, and FinOps discipline into a single framework built specifically for the unique failure modes of autonomous agents.

This is extremely important for a long-running campaign optimization task with durable execution, checkpointing, and failure recovery as shown below. The agent persists state at each phase, enabling crash recovery from the last checkpoint.

Long-running Workflow with Checkpointing

AgentOps should cover the full lifecycle: design, deployment, governance, scaling, and the operational practices that determine whether your agents are an asset or a liability.

The mental model I use: treat an agent like a new employee with perfect recall, no judgment, infinite speed, and zero accountability. You'd never onboard a hundred of those without HR policies, performance management, audit trails, and someone who can fire them when they go rogue.

AgentOps is the HR department, the compliance team, and the SRE org for your agent fleet.

Why you need this urgently now

The adoption curve on agentic AI is steep and it's not waiting for your governance process to catch up. The market is projected to expand from $3.81 billion in 2025 to $71.91 billion by 2033, 46% CAGR.

More immediately: according to a 2026 CrewAI survey, 65% of enterprises already have agents running in production, and 81% are actively scaling deployments. It's already in your org chart, whether your platform team knows about it or not.

Most of these deployments are operating in the dark: ungoverned, unmonitored, and quietly billed to someone's departmental credit card. And yes, many organizations will say they have “agent observability” in place but if you've looked under the hood, you already know how thin that often is. A small minority may be genuinely prepared. For everyone else, incidents are just waiting for the right moment.

The real productivity bottleneck isn't model quality anymore. And with model prices falling, cost is no longer the central question either. We've moved past “Can we afford to run this?” The question now is much more important: “Can we trust what it's doing?”

The 93% problem

93% of AI agent projects failed before reaching production in 2025. Before you dismiss that as FUD, here's the engineering reason: teams optimized for demo success, not operational readiness.

A PoC that works in a notebook with hand-crafted inputs and a fresh API key tells you almost nothing about whether the system will behave correctly at 10x volume with real user inputs, stale context, downstream APIs, and a token budget that someone set without understanding what multi-step reasoning actually costs.

It is not a bold prediction to make that most agentic AI projects will abandoned by 2027 because the operational discipline isn't there to run them. The gap between "it works in the demo" and "it works in production" is exactly the gap AgentOps fills.

The four ways current approaches fail

I've reviewed production agent deployments across dozens of organizations. The failure patterns are almost consistent. Most teams get this wrong in one of four ways:

The PoC Trap

Most common failure

Demo works beautifully with curated inputs and a patient demo audience
No observability layer: when it breaks in production, no one can explain why
No reliability engineering: retries, circuit breakers, idempotency, all absent
60-70% of PoCs expand in scope, but hit the wall when someone asks 'how do we run this at scale?'

Shadow Agent Deployments

Governance time-bomb

Business units are already deploying agents with consumer-grade tooling, with minimal security review
IT and security teams have limited red-teaming knowledge or suboptimal visibility until an incident surfaces
Each rogue deployment is an unaudited autonomous decision-maker with access to real systems
By the time you find them, they've already made consequential decisions you can't undo

The Observability Blind Spot

Debugging nightmare

Traditional APM (Datadog, New Relic) was built for deterministic request/response systems
It cannot trace multi-step reasoning chains, tool-call sequences, or causal logic
When an agent makes a bad decision, teams can't reconstruct what happened
This is like running a distributed system with no distributed tracing, you're flying blind

Runaway Token Costs

FinOps gap

Multi-step reasoning loops with tool calls can burn 100x what a simple prompt costs
Retry loops on tool calls compound the problem exponentially
Total cost routinely exceeds initial estimates by 3-5x once agents hit real-world inputs
No one has established FinOps practices for agent workloads, you're making it up as you go

What I'd build if I were starting a platform team from scratch

Here's how I think about the differentiation. Most approaches treat agent deployment like application deployment: ship it, monitor it, patch it when it breaks.

That mental model is wrong.

Agents are better understood as distributed systems with non-deterministic components that make autonomous decisions with real-world consequences.

That means you need four disciplines working together:

Platform Engineering Rigor

Foundation

Standardized runtime environments: same container, same config, same secrets management everywhere
CI/CD pipelines for agents, prompts, and tools (yes, prompts need versioning and regression tests)
Infrastructure-as-code for agent deployments so you can reproduce and roll back
Internal developer platform so teams aren't reinventing the scaffolding every time

Distributed Systems Reliability

Production-grade

Idempotency on every tool call: if a network blip causes a retry, you cannot send the email twice
Circuit breakers on every external dependency: your agent should degrade gracefully, not fail catastrophically
Saga patterns for multi-step workflows that touch multiple systems with real-world side effects
Blast radius containment: scope what each agent can access, so a misconfigured prompt can't take down everything

Human-Centered Governance

Trust layer

Graduated autonomy: not every decision should be fully automated: high-stakes actions need a human in the loop
Approval workflows with SLAs so human review doesn't become a bottleneck that kills the value prop
Full decision audit trails: who triggered it, what reasoning was used, what tools were called, what was changed
Trust-building UX that shows operators what the agent is doing before the action is taken

Agent FinOps

Economics

Token-level cost attribution: which agent, which task, which team is spending what
Model routing optimization: you don't run GPT-5 on a task that Claude Haiku handles fine
Budget guardrails that cut off runaway loops before they drain your API quota
Cost-performance trade-off modeling: the right model at the right price for each task class

The honest bottom line

Without this operational foundation, you are running autonomous systems with the infrastructure maturity of a hackathon project.

That's not great when the agent has write access to your CRM, your email system, or your payment processor.

The question is whether you build it yourself from first principles (expensive, slow) or engage a team that has already mapped the failure modes and built the patterns.

This is the same choice engineering orgs faced with Kubernetes in 2017: roll your own or adopt battle-tested patterns. Most teams that tried to roll their own container orchestration in 2017 quietly migrated to Kubernetes by 2019.

Market Narrative & Urgency

Three software eras, the autonomous employee paradigm, and why Klarna's rollback is a very important case study

The three eras of enterprise AI through Karpathy's lens

Andrej Karpathy's Software 1.0/2.0/3.0 framework is the clearest mental model I've found for explaining why agentic AI demands a completely different operational approach.

The short version: Software 1.0 is code a human writes. Software 2.0 is behavior learned from data. Software 3.0 is systems directed by natural language. Each era introduced new failure modes that required new operational disciplines and most enterprises are still applying 1.0 operating models to 3.0 systems.

Dimension	Software 1.0: Traditional ML/AI	Software 2.0: Standard GenAI	Software 3.0: Agentic AI
Execution model	Deterministic, stateless, explicit code paths	Single model inference, bounded context window, always human-initiated	Autonomous multi-step reasoning, persistent state, tool use, multi-agent coordination
Failure blast radius	One bad prediction fails in isolation. Blast radius = one data point	Response quality degrades, user notices, user retries	Cascading failures across tool calls, hand-offs, and multi-agent workflows.
Cost model	Predictable compute per inference	Token-based, moderately predictable with prompt engineering discipline	Highly variable. Reasoning depth, retry loops, and multi-agent message passing make costs 3-5x more volatile than estimates
Testing surface	Standard ML metrics: accuracy, precision, recall, F1. Reproducible	Human evaluation, red-teaming, prompt regression	Combinatorial explosion: reasoning paths x tool interactions x agent coordination x adversarial inputs. You cannot enumerate all failure modes
Compliance requirements	Model registry, data lineage, training data documentation	Prompt logging, basic audit trail, output filtering	Full decision audit: who authorized, what tools were invoked, what data was accessed, what real-world actions were taken and by whose authority
Operating model	MLOps: data pipelines, model registry, A/B testing, feature stores	Prompt engineering, LLMOps, output monitoring	AgentOps: all of the above plus orchestration, HITL governance, FinOps for variable workloads, and reliability engineering for non-deterministic systems

The operating model has to evolve in lockstep with the software paradigm.

You wouldn't run a microservices architecture with the deployment practices you used for a monolith.

You wouldn't run a distributed database with the backup strategy you used for a single Postgres instance.

Yet most enterprises are trying to run Software 3.0 systems, i.e. agents that make autonomous multi-step decisions with real-world consequences with the operating practices they built for Software 1.0.

Agents as autonomous employees

In February 2026, Karpathy ran an experiment he called nanochat: he set up 8 agents with 4 running Claude, 4 running Codex and each with a GPU, organized as a pseudo "research org", tasked with doing ML research.

His summary: "The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at."

The agents were genuinely good at implementing well-scoped tasks. They were terrible at creative ideation and experiment design.

One agent "discovered" that increasing hidden size improves loss, a completely spurious result that would have sent a human researcher down the wrong path for weeks.

His key insight from the experiment: "You are now programming an organization. The 'source code' is the collection of prompts, skills, tools, and processes."

A daily standup is now "org code" and if agents are an organization, then deploying them without governance structures, performance management and audit mechanisms is a management problem.

And management problems are harder to fix after the fact than before.

The Klarna story is the case study I recommend every engineering leader read before they tell me their agent deployment is fine.

Klarna deployed an AI customer service agent that handled 2.3 million conversations per month, equivalent to 700 full-time human agents.

Resolution time dropped from 11 minutes to 2 minutes. The numbers looked spectacular but then they reversed course.

The Klarna lesson: speed without judgment is not a feature

After the initial deployment success, Klarna's customer satisfaction scores began to slip. The feedback was consistent: "frustrated customers began voicing that the AI-driven service felt too rigid and soulless." The agent was fast and accurate on routine queries. It was incapable of the kind of empathetic, contextual judgment that a good customer service rep applies when a customer is genuinely distressed. Klarna ended up rehiring humans.

Their conclusion: "AI gives us speed. Talent gives us empathy. Together, we can deliver service that's fast when it should be, and empathic and personal when it needs to be." The operational implication: you need a graduated autonomy model that routes interactions based on complexity, emotional valence, and risk and not a binary "automate or don't automate" decision. That's a Human-in-the-Loop design problem, not a model quality problem.

Customer service is typically the #1 agent use case, followed by research and analysis. For large enterprises (10k+ employees), internal productivity is the top use case and these are high-stakes workflows, i.e. they are not the place to learn your operational lessons.

The real risks of not investing early

I'll be direct about this: the risk of moving too slowly is just as real as the risk of moving recklessly. The teams that get this right in 2026 will have compounding advantages by 2028, both in operational capability and in the organizational knowledge of how to run these systems safely.

The teams that wait for a governance framework to materialize on its own will be playing catch-up on both fronts simultaneously.

Risks of delaying in order of how often I've seen them materialize

0/6 complete

Operational incident from ungoverned agents: one unauthorized action can trigger regulatory exposure, customer harm, and trust destruction faster than you can get legal on the phoneTrust erosion is asymmetric: it takes one bad agent interaction to destroy user confidence in a workflow that took six months to build. Rebuilding trust is 5–10x more expensive than establishing itCompliance exposure is incoming: EU AI Act high-risk system requirements, SEC guidance on AI-assisted decisions, and state-level AI regulations all demand explainability, auditability, and documented human oversightThe 88% PoC failure rate is a direct consequence of teams that built for the demo, not for production operations. Missing the operational foundation means redoing the workCost overruns compound: multi-agent coordination overhead scales non-linearly with agent count. A system with 5 agents is not 5x harder to run than one agent, it's closer to 25x, and your initial cost models won't reflect thatTalent gap will widen: the engineers who know how to build and operate production agent systems are already in high demand. The gap between orgs that have this expertise and those that don't is growing faster than hiring can close it

What actually triggers organizations to invest

In my experience, there are six events that reliably create urgency. Most of them involve something having already gone wrong. I'd rather talk to you before the incident than after.

Post-Pilot Failure

Most common trigger

The PoC worked and everyone was excited but then production blocked on 'how do we actually run this?'
Stakeholders want to scale but the team is staring at a missing operational layer they don't know how to build
Usually surfaces 6-8 weeks after the demo when someone asks for a runbook

The First Production Incident

Critical — reactive

An agent makes an incorrect decision with material impact: financial loss, customer harm, compliance flag
Post-mortem can't reconstruct what happened because there's no observability layer
Legal wants a paper trail that doesn't exist

Compliance Push

External forcing function

EU AI Act high-risk classification deadlines approaching
Internal audit flags ungoverned AI systems, often discovered during a broader IT audit
Regulator inquiry triggers a scramble to document what the agents are actually doing

Runaway API Bills

CFO's attention

Cloud costs spike unexpectedly traced to agent retry loops or unbounded reasoning chains
Finance wants attribution and forecast models that engineering can't currently provide
Usually the event that gets a VP Engineering to escalate the conversation

Board / C-Suite Mandate

Top-down pressure

CEO or board mandates an AI strategy with measurable outcomes by a specific date
Creates urgency without operational readiness, which is the worst combination
Governance and risk management get added to the AI initiative scope post-hoc

Platform Consolidation

Strategic initiative

IT leadership discovers five different teams using five different agent frameworks with no shared infrastructure
Someone does the math on duplicated tooling, redundant integrations, and inconsistent security postures
AgentOps becomes part of a broader platform rationalization initiative

What the market data says

Enterprises are shifting budgets toward scaled AI deployment with growing emphasis on ROI and scalable enterprise-grade solutions. Most AI PoCs expand into larger programs, but procurement and decision cycles are lengthening as enterprises demand measurable outcomes.

The hype cycle is compressing. The deals that are closing now are going to teams that can show operational credibility, not just a compelling demo.

Work Definition

Who feels this pain, what the actual problems are, and what we believe about how to solve them

What we believe

Here's the honest version of our vision statement: every company that deploys autonomous agents into production deserves the same operational discipline that serious engineering organizations apply to mission-critical infrastructure.

We believe the current state of "ship an agent, watch it with some logging, fix it when it breaks" is the equivalent of running a distributed database without monitoring, without failover, and without a documented runbook.

It works fine until it doesn't. And when it doesn't, the blast radius is large.

The scope of what we're building covers the full agent lifecycle: strategy and readiness assessment (do you know what you're actually trying to do?), architecture design (is the technical foundation sound?), production deployment (is it operationally ready?), ongoing operations (is it running safely and economically?), and organizational enablement (can your team maintain and evolve it without us?).

We're cloud-agnostic, framework-agnostic, and model-agnostic because the patterns that matter here transcend any specific technology choice, and vendor lock-in is a risk we actively help clients avoid.

The 5% model / 95% system principle

Google's foundational paper on ML systems found that actual model code is less than 5% of a production AI system. The other 95% are data pipelines, feature stores, serving infrastructure, monitoring, testing, governance, which is where the real engineering happens. This is even more true for agentic systems, where the "model" is one node in a larger orchestration graph, and the work should addresses the 95%.

Who actually feels this pain

Let's just be direct about who's having which conversation in their organization right now:

Role	The 3 AM thought keeping them up	What they actually need
CIO	"We have five teams building five different agent stacks with no shared platform, no consistent security posture, and no idea what they're collectively spending. I'm going to find out about the first incident in a board meeting."	Platform strategy that consolidates fragmented efforts, predictable TCO, governance framework that scales, and a roadmap they can present to the board with confidence
CTO	"Our PoC sprawl is creating massive technical debt. Every team is building their own retry logic, their own observability shim, their own auth layer. Nobody's using the same framework. We're going to spend six months undoing this."	A sound reference architecture they can standardize on, engineering best practices they can enforce, and a platform that gives developers leverage without constraining them
CISO	"Agents have read/write access to production systems, external APIs, and customer data. They can be prompt-injected. They bypass our standard access control policies because nobody wrote policies for 'autonomous AI system.' This is not a small problem."	Agentic threat model, least-privilege access patterns, prompt injection defenses, data exfiltration controls, and compliance documentation for EU AI Act / NIST AI RMF
VP Engineering	"My team knows how to build features. They don't know how to operate non-deterministic systems at scale. When an agent breaks, nobody knows how to debug it. Our incident response playbooks don't cover this. I'm training people in real-time during production incidents."	Developer experience that makes the right patterns the easy path, incident taxonomies and runbooks, observability tooling that actually works for agent workflows, and team training that transfers knowledge
COO	"The automation project was supposed to reduce costs by 40%. Six months in, costs are actually up because we had three major rework cycles and our customer satisfaction scores dropped when the agent got it wrong. I need this to actually work, not just demo well."	Process automation that delivers measurable ROI, operational resilience that holds up under real workloads, and quality consistency that doesn't require constant human correction

The real problem statements

These aren't theoretical problems. I've seen every one of these surface in production environments:

P1: The Pilot-to-Production Chasm

88% PoC failure rate

Demo runs on curated data with patient observers. Production runs on adversarial inputs with users who will find every edge case in the first week.
Missing: idempotency on tool calls, circuit breakers on external dependencies, observability on reasoning chains, cost guardrails on token consumption
It requires building the operational layer the agent runs on
By the time teams realize this, they've usually already committed to a launch date

P2: Ungoverned Autonomous Decisions

Regulatory & reputational risk

Agents make consequential decisions, i.e. financial, operational, customer-facing, without the audit trails that compliance and legal require
When something goes wrong, post-mortems can't reconstruct the decision chain because nothing was logged at the right level of granularity
EU AI Act, SEC guidance on AI-assisted financial decisions, and state-level regulations are creating concrete compliance requirements that current deployments don't satisfy
The fix is building governance into the architecture from the start

P3: Cost Unpredictability at Scale

3–5x budget overruns

Multi-step reasoning loops, retry cycles on flaky tool calls, and multi-agent message passing all multiply token costs in ways that initial estimates don't capture
No established FinOps practices exist for agent workloads, teams are adapting cloud cost management patterns that don't map cleanly
Cost allocation across teams and use cases is nearly impossible without purpose-built attribution tooling
Budget alerts and spend caps that work for cloud compute don't translate directly to token-level economics

P4: Platform Sprawl and the Integration Tax

Compounding technical debt

Different teams adopting LangGraph, LlamaIndex, AutoGen and custom frameworks, none of them interoperable
Every team rebuilds the same scaffolding: auth, retry logic, observability shims, deployment config
No shared model evaluation infrastructure means different teams are making model selection decisions with inconsistent data
The longer this goes on, the more expensive the consolidation becomes, and someone will eventually have to consolidate

How we engage and why each mode exists

Different organizations are at different points in the same journey. We've structured engagement modes to match where you are, not where we'd like you to be:

Engagement	Duration	When to use it	What you get
Advisory	2–4 weeks	You have a mandate to build agent capabilities but no clear strategy or technical direction yet. You need to make the case internally before committing budget.	Strategy alignment, maturity assessment across 8 dimensions, prioritized roadmap, and executive-ready framing of what it will take to do this right
Assessment	3–6 weeks	You already have agent deployments (or PoCs) and want an honest technical audit before you scale. "Are we building this right?" is the question.	Technical deep-dive, gap analysis against production-grade standards, architecture review, and risk assessment with specific remediation steps
Pilot Acceleration	6–12 weeks	You have a PoC that works in controlled settings and need to get it to production-grade operations without building the entire platform from scratch.	A PoC transformed into a production-ready workload with full operational scaffolding: observability, reliability, HITL workflows, security hardening, and cost baseline
Platform Build	12–24 weeks	You're ready to invest in a proper agent platform that multiple teams and use cases can run on. You want to build this once, correctly, and not redo it in two years.	A production-grade AgentOps platform: runtime environment, observability stack, governance layer, FinOps tooling, developer experience, and the operational runbooks to go with it
Managed Operations	Ongoing	You've built the platform and want operational excellence without standing up an internal AgentOps function from scratch. SLA-backed, with knowledge transfer built in.	24/7 monitoring, incident management, continuous optimization, SLA-backed support, monthly operational reviews, and a quarterly maturity cycle to keep improving

Service Lines & Modules

Eleven engineering initiatives, not consulting deliverables — here's what each one actually builds

I've deliberately structured these as engineering initiatives rather than consulting deliverables, because the distinction matters.

A consulting deliverable is a document that describes what you should build. An engineering initiative is a set of working artifacts, e.g. architecture decisions, deployed infrastructure, runbooks, and tested patterns, that your team inherits and can operate independently.

That's what these modules produce.

How the modules fit together

The eleven modules are composable.

Most organizations don't need all eleven at once but they need the three or four that directly address their current pain.

The dependency graph is roughly: M1 (strategy) and M2 (architecture) inform everything else. M3 (productionization) is the fastest path to unblocking a stuck PoC. M4 (reliability) and M7 (FinOps) are almost always deployed together because the same observability infrastructure serves both purposes. M5 (HITL) and M6 (security) are often bundled for compliance-driven engagements. M8 (orchestration) becomes relevant once you have more than two agents coordinating. M9-M11 are the maturity layer for organizations scaling past initial production deployments.

Strategy and architecture modules

These two modules are the foundation.

M1: Strategy & Readiness Assessment

2-4 weeks

Maturity assessment across 8 dimensions: platform, observability, governance, security, FinOps, HITL, developer experience, organizational enablement with 5 levels each
Stakeholder interviews and alignment workshops: gets the CIO, CTO, CISO, and VP Engineering in the same room about what this actually requires
Gap analysis with a prioritized remediation plan sorted by risk, impact, and implementation effort
Deliverable: AgentOps Maturity Scorecard and a strategic roadmap that your leadership team can present and defend

M2: Platform Architecture Blueprint

3–6 weeks

Production-grade architecture across 15 reference layers: ingestion, routing, orchestration, tool registry, state management, observability, security, FinOps, and more
Technology selection matrix with documented rationale: why this framework over that one, what the trade-offs are, what the exit paths look like
Integration architecture for the enterprise systems your agents will actually touch: CRMs, ERPs, data warehouses, internal APIs
Architecture Decision Records (ADRs) for key trade-offs, so future engineers understand why decisions were made, not just what was decided

Production and reliability modules

This is where Uber's Michelangelo lesson applies:

Uber's internal ML platform standardized the lifecycle across dozens of models because the alternative, each team building their own training, serving, and monitoring infrastructure was both wasteful and brittle.

Their monitoring layer was explicitly "not optional" because models degrade silently unless you actively watch for it. That's doubly true for agents, where the failure modes are non-deterministic and the downstream consequences are real-world actions.

M3: Productionization Accelerator

6–12 weeks

Takes a working PoC and builds the operational layer around it: the difference between a demo that impresses and a system you can hand to an on-call engineer at 2 AM
Reliability hardening: idempotent tool calls, exponential backoff with jitter, circuit breakers on external dependencies, timeout contracts
Observability stack: distributed tracing of reasoning chains and tool calls, structured logging, alerting thresholds, evaluation pipelines for offline and online quality monitoring
Security hardening and cost baseline: least-privilege access, prompt injection defenses, token budget constraints, and a documented cost model for the workload

M4: Reliability & Observability

4–8 weeks

SLIs, SLOs, and SLAs adapted for non-deterministic workloads, harder than it sounds, because standard availability metrics don't capture quality degradation
Distributed tracing with OpenTelemetry extended for agentic workflows: trace propagation across agent hand-offs, tool calls, and multi-agent coordination
Evaluation pipelines: offline regression testing for prompt and tool changes, online monitoring for production quality drift
Incident taxonomy and response playbooks: classification of agent-specific failure modes (reasoning failures, tool call failures, coordination failures, cost blowouts) with documented response procedures

M8: Multi-Agent Orchestration

6–10 weeks

Workflow topologies with engineering trade-offs documented: sequential (simple, debuggable), parallel (faster, harder to coordinate), hierarchical (scalable, complex), and event-driven (responsive, operationally intensive)
Durable state and checkpointing for long-running workflows — so an agent that fails 80% through a multi-hour task can resume from the last checkpoint, not restart from zero
Saga patterns for multi-step workflows with real-world side effects: compensating transactions when a downstream step fails after upstream steps have already committed
Chaos engineering and load testing: deliberately injecting failures into the orchestration layer to find the blast radius before production traffic does

The Netflix three-stage pipeline principle

Netflix's recommendation system, which drives 80% of what people actually watch, runs a three-stage pipeline: candidate generation, ranking, re-ranking.

Each stage is progressively more expensive, which means you don't run your most expensive model on everything.

This principle applies directly to agent orchestration: route simple tasks to fast, cheap models and complex reasoning to expensive frontier models.

The M4 and M8 modules build the infrastructure to implement this kind of tiered execution. Without it, you're running GPT-5 on tasks that a smaller model handles fine and paying 10x for the privilege.

Governance, FinOps, and security modules

The Klarna story is the right frame here:

Speed alone isn't a product and governance is the layer that determines when the agent should act autonomously, when it should escalate to a human, and when it should refuse to proceed.

Get this wrong and you get the rigid, soulless experience that drove Klarna to rehire humans. Get it right and you get the "fast when it should be, empathic when it needs to be" system they were aiming for.

M5: HITL Workflow Design

4-8 weeks

4-tier autonomy model: Full Auto (routine, low-risk, high-confidence), Assisted (agent proposes, human approves), Supervised (human monitors in real-time), Manual (human executes, agent assists)
Routing logic based on confidence score, task risk classification, and customer segment
Approval checkpoint design with SLAs: human review that takes 8 hours kills the latency advantage of automation. We build the UX and process to keep review cycles under 15 minutes for most classes of decisions
Escalation paths (L1-L4) with documented SLAs and reviewer calibration programs to maintain consistency

M6: Security & Compliance

4-8 weeks

Agentic AI threat modeling: prompt injection (the SQL injection of agentic systems), tool call abuse, data exfiltration via reasoning traces, and multi-agent coordination attacks
Least-privilege access architecture: every agent gets the minimum permissions required for its task scope
Policy-as-code for agent authorization: guardrails that are version-controlled, testable, and auditable
Compliance mapping and documentation for EU AI Act, NIST AI RMF, and ISO 42001, plus red-team exercises that stress-test the defenses before an external auditor does

M7: Agent FinOps

3-6 weeks

Token-level cost attribution by agent, task, team, and use case: the foundation for chargeback, budgeting, and ROI calculation
Model routing optimization: a decision framework for which model class to use for which task type, with documented cost/quality/latency trade-offs
Budget guardrails and spend caps: hard limits that cut off runaway loops before they drain the API quota, with graceful degradation rather than hard failures
Forecast modeling: given your current agent fleet and usage patterns, here's what next quarter costs, with confidence intervals

UX, operating model, and managed services

Engineers now need to be generalists, not specialists.

The operating model for agent systems blurs the lines between platform engineering, ML engineering, SRE, and product engineering. The last three modules are about making that transition manageable rather than chaotic.

M9: UX for Agentic Systems

4-8 weeks

Operator cockpit: a fleet-level view of agent health, active workflows, cost burn rate, and SLO status: the thing your platform team needs to run agents at scale without being glued to a terminal
Reasoning trace visualization: making the agent's decision chain legible to non-engineers, so product managers and business users can understand what happened when something goes wrong
Intervention UX: pause, resume, cancel, and parameter override: designed for the specific latency constraints and cognitive load of operator workflows
Trust signal design: confidence indicators, source attribution, and uncertainty communication: the UX patterns that build user trust rather than erode it

M10: Operating Model & CoE

8-16 weeks

Center of Excellence charter with new role definitions: AgentOps Engineer, Agent SRE, Prompt Engineer, and how they interact with existing platform and product engineering teams
Training curriculum and certification paths are required because 'we'll figure it out as we go' produces inconsistent operational quality and high knowledge concentration risk
Change management: the resistance to agent adoption is real, and it comes from engineers who are worried about their jobs, operators who don't trust the outputs, and executives who don't understand the risks
Maturity model for continuous improvement: quarterly assessment cycles that measure progress against the 8-dimension framework and identify the next investment priorities

M11: Managed AgentOps

Ongoing

24/7 monitoring, incident management, and continuous optimization with SLA-backed support (99.5% platform availability)
On-call coverage for agent-specific incidents: the failure modes are different enough from standard application incidents that you need specialists who've seen them before
Monthly operational reviews with actionable FinOps reporting: not just what was spent, but why, and what the optimization opportunities are
Quarterly maturity assessment and improvement plan, ensuring the platform doesn't drift back toward the operational debt it was built to replace

Key takeaways

5 items

1Eleven modules, composable: start with M1 (strategy) and M3 (productionization) if you're stuck between PoC and production
2Each module produces working engineering artifacts, not just documentation
3M4 (reliability) and M7 (FinOps) share infrastructure and are most effective when deployed together
4M5 (HITL) and M6 (security) are the governance layer, essential for any deployment touching regulated decisions or sensitive data
5The dependency graph flows from strategy through production through governance toward scale

Reference Architecture

Hyperscaler platforms, open standards, and the patterns we actually use in production

I've spent time with all four major hyperscaler platforms, and here's my honest assessment: none of them is dominant in all dimensions, and the choice usually comes down to where your data gravity already is. If your entire data estate lives in Databricks Delta Lake, fighting that to deploy on AWS is a tax you'll pay forever. If your org runs on M365 and Entra ID, Azure's 1,400+ connector ecosystem is genuinely compelling. Start from the constraint, not the features list.

Hyperscaler platform comparison

Each platform has a genuine differentiator, and each has a genuine blind spot. The table below is the version I'd actually put in a design doc.

Capability	AWS AgentCore	Azure AI Foundry	GCP Vertex AI	Databricks Mosaic
Agent Runtime	MicroVM isolation, 8hr sessions	Cosmos DB state	Sessions + Memory Bank	Model Serving + Agent Bricks
Orchestration	Strands SDK, Bedrock Agents	Semantic Kernel, AutoGen	ADK (7M+ downloads), Genkit	ChatAgent, LangChain
Guardrails	Guardrails + Cedar Policy	Content Safety + Prompt Shields	Model Armor + SCC	AI Gateway (PII, content)
Evaluation	13 built-in evaluators	AI Eval SDK + Red Teaming	User Simulator + Eval Service	MLflow judges, ALHF
Low-Code	Amazon Q (40+ connectors)	Copilot Studio (1,400+)	Agent Builder Console	AI Playground

My honest take: no single platform dominates

AWS leads in runtime isolation and deterministic policy enforcement, Cedar Policy outside the LLM reasoning loop is the right call, and MicroVM sandboxing is genuinely hard to replicate.

Azure leads in enterprise integration breadth, if your stack is already M365 + Entra ID + Dynamics, the 1,400+ Copilot Studio connectors are a legitimate accelerator, not lock-in.

GCP leads in open-source pedigree, ADK and the A2A protocol they seeded at the Linux Foundation show the "build the ecosystem first" strategy they've used well before.

Databricks leads in data gravity, if your features and training data live in Unity Catalog Delta Lake, running inference in the same platform eliminates an entire class of data movement bugs. Pick based on where your data already lives, not the feature matrix.

Open standards and protocol adoption

Remember when every team at a mid-size company had their own bespoke REST adapter for each internal service? You had the Salesforce adapter team, the SAP adapter team, and the "miscellaneous glue code" team that everyone quietly hated.

We spent a decade learning that lesson with service-oriented architecture. MCP and A2A exist because we're not going to learn it again. Build on standards from day one, or you'll be the person who has to migrate fifteen custom adapters when your third AI vendor gets acquired.

Model Context Protocol (MCP)

Agent-to-tool

Created by Anthropic, open specification, already de facto standard
How agents discover and invoke tools without custom adapters
Adopted by SAP, Snowflake, AWS, Azure, GCP, critical mass achieved
Think of it as OpenAPI for agents: one spec, universal tooling

Agent-to-Agent (A2A)

Agent-to-agent

Created by Google, now Linux Foundation governed
How agents delegate tasks and collaborate across platform boundaries
100+ companies including AWS, Microsoft, SAP, Salesforce
Without this, every multi-cloud agent system becomes a custom integration nightmare

OpenTelemetry (OTEL)

Observability

CNCF standard, it won the observability format war, use it
GenAI semantic conventions now cover LLM calls and agent spans natively
Supported by all hyperscalers + MLflow, LangSmith, Arize, Honeycomb
If you build a proprietary trace format, you'll regret it within 6 months

Apache Iceberg

Data layer

Open table format, the Parquet of the data lake generation
Supported across AWS, Azure, GCP, Databricks, Snowflake simultaneously
Agents querying data through Iceberg work regardless of which compute they run on
Delta Sharing for cross-platform data federation without moving bytes

Architecture pattern templates

These three patterns emerged from squinting at every agent deployment I've seen work at scale.

Customer Service Agent

Pattern 1

Queries CRM, ERP, ITSM: executes transactions with idempotency keys
Supervisor-collaborator topology: one planner, multiple specialist executors
PII guardrails at ingress AND egress
HITL escalation for anything with financial impact > $X or regulatory surface

Financial Data Analysis

Pattern 2

Text-to-SQL with query validation before execution
Calculations, visualizations, executive summaries
Unity Catalog or Snowflake RBAC: column-level access control, not just table-level
Batch inference for large datasets, i.e. don't use streaming inference where you don't need it

Multi-Cloud Document Processing

Pattern 3

Ingest via email, upload, or SoR events: async from the start
Extraction → Validation → Action pipeline with checkpoint at each stage boundary
A2A protocol for cross-cloud agent coordination, i.e. no custom RPC or XML
OTEL tracing unified across all clouds

The 75% multi-model reality

Nobody's running a single-provider shop anymore. Build your routing layer with that assumption baked in from the start. The cost of retrofitting model routing is roughly proportional to how much you regret not doing it initially.

Knowledge Check

Which open protocol enables agents to discover and invoke tools across platforms without custom adapters?

Engineering Best Practices

The patterns that actually matter in production

Reliability patterns

Every one of these patterns has an origin story that involves a production incident. Let me save you some pagers and incident reports.

Idempotency & Retry Safety

Core pattern

Every agent operation must be safely retryable which is non-negotiable
Idempotency keys for all tool calls with side effects (writes, sends, transactions)
Deduplication at the orchestration layer, not just at the tool level
If you can't replay it safely, you can't retry it safely

Circuit Breakers & Backpressure

Fault isolation

Timeouts with exponential backoff for all external calls
Circuit breakers trip after N consecutive failures, half-open probes before reset
Backpressure signals prevent queue saturation from cascading into OOM
I've seen agents burn $2-3K in tokens in 4 hours because a circuit breaker was missing

Graceful Degradation

Resilience

Model fallback chains: primary → secondary → cached response
Reduce scope rather than fail entirely, a partial answer beats a 503
Pre-define degraded-mode behavior per agent before you deploy, not after an incident
Feature flags let you disable expensive capabilities under load without a deploy

Fault Domains & Blast Radius

Containment

Isolate agents into failure domains, one agent's crash cannot cascade to the fleet
Per-tenant and per-workflow isolation: noisy neighbor protection is table stakes
The blast radius question to ask before every design: 'if this breaks completely, what else breaks?'
MicroVM isolation (AWS AgentCore approach) is the right answer for untrusted agent code

The 95% problem

I keep seeing teams invest months fine-tuning their prompts and picking the right model, then deploy with no retry logic, no circuit breakers, and no meaningful observability. The model can be swapped out in an afternoon but bad infrastructure architecture cannot.

Workflow and state patterns

The workflow patterns that matter

0/5 complete

Asynchronous workflows: decouple agent execution from request handling: queues first, synchronous as an optimization, never as a defaultDurability and checkpointing: persist state at each stage boundary so a crashed agent picks up where it left off, not from scratchSaga / compensation patterns: every step that modifies external state needs a defined rollback, write the undo path before you write the do pathDeterministic vs probabilistic boundaries: wrap all LLM calls in deterministic harnesses with input/output schemas; your orchestrator should never see raw LLM textMemory lifecycle management: active pruning, summarization, and context hygiene: 'context rot' is a real failure mode where stale information in context actively degrades performance

Deployment and release patterns

Netflix runs thousands of concurrent A/B experiments across their recommendation systems. They didn't get there by deploying features to 100% of users and hoping for the best. The same graduated rollout discipline applies to agents, maybe more so, because a bad prompt change can produce subtly wrong outputs for weeks before you detect it through quality metrics alone.

Versioning Strategy

Configuration

Semantic versioning for agent, prompt, and tool changes
Immutable deployment artifacts: no 'edit the prompt in the console' in production
Prompt versioning with diff tracking as you need to know exactly what changed between v1.3 and v1.4

Canary & Shadow Runs

Safe rollout

Route 1-5% of traffic to new agent version; watch quality metrics
Shadow runs: new version processes real requests in parallel without surfacing results
Staged rollout: canary → 10% → 50% → 100% with explicit quality gates at each step

Reproducibility

Debugging

Full trace replay from any production execution
Deterministic replay for regression testing: fix a bug, prove you fixed it, prevent it forever
Seed-based reproducibility where possible; document exactly where non-determinism is intentional vs accidental

Observability standards

Trace Schema Design

OTEL-based

OpenTelemetry GenAI semantic conventions
Spans for: LLM call, tool invocation, decision point, HITL checkpoint, cost accrual
Custom attributes: token count, model version, confidence score, cost-per-call
Trace correlation across multi-agent workflows

SLIs, SLOs, SLAs

Metrics

Agent success rate: end-to-end task completion, not just 'did it return a 200'
Latency p50/p95/p99 per agent step type as LLM calls and tool calls have very different distributions
Tool call reliability and error classification: transient vs permanent failures are handled differently
Cost per successful agent execution: the SLO nobody writes down until they get the AWS bill

Here's what a minimal but complete OTEL trace configuration looks like for an agent workflow. This is the baseline I'd expect to see in any production deployment:

Agent observability trace — production baseline (Python / OTEL)

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Wire up OTEL once at startup — every agent run inherits this context
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agentops.workflow")

def run_agent_step(agent_id: str, step_type: str, input_data: dict):
    with tracer.start_as_current_span(f"agent.{step_type}") as span:
        span.set_attribute("agent.id", agent_id)
        span.set_attribute("agent.step_type", step_type)
        span.set_attribute("agent.model", "claude-3-7-sonnet")
        
        # Record cost BEFORE the call — you need this even if the call fails
        span.set_attribute("agent.input_tokens_estimate", estimate_tokens(input_data))
        
        try:
            result = execute_step(agent_id, step_type, input_data)
            
            # Quality signal — not just latency
            span.set_attribute("agent.confidence_score", result.confidence)
            span.set_attribute("agent.output_tokens", result.token_usage.output)
            span.set_attribute("agent.cost_usd", result.cost)
            span.set_attribute("agent.step_success", True)
            return result
            
        except AgentToolError as e:
            # Tool failures and LLM failures are different SLIs
            span.set_attribute("agent.step_success", False)
            span.set_attribute("agent.error_type", "tool_failure")
            span.set_attribute("agent.error_class", type(e).__name__)
            span.record_exception(e)
            raise
        except LLMRateLimitError as e:
            # Rate limit = retriable; record separately for SLO math
            span.set_attribute("agent.error_type", "rate_limit")
            span.set_attribute("agent.retriable", True)
            span.record_exception(e)
            raise

Quality is still the #1 production blocker

Quality is the #1 production blocker, ahead of latency and cost (now a smaller concern as model prices continue falling). Quality problems are invisible without the right instrumentation. Your trace schema needs confidence scores and task completion signals built in from day one.

Human-in-the-Loop Operating Patterns

Graduated autonomy, approval checkpoints, escalation paths

Andrej Karpathy described Cursor AI's progression as an "autonomy slider": Tab completion → Cmd+K (edit on selection) → Cmd+L (chat with context) → Cmd+I (full agent mode).

Notice what he's describing: a graduated series of partial autonomy levels where each step hands off a bit more control. Rather than chasing full automation dreams, advocate for graduated levels of AI assistance that match what the system has actually demonstrated it can do reliably.

Risk-tiered autonomy model

Think of this like Tesla's Autopilot levels.

Level 2-3 automation, where the human remains responsible and in position to take over, is dramatically more practical at scale than jumping straight to Level 5.

Level 5 sounds better in a press release. Level 2-3 is what actually ships and stays shipped.

The graduated autonomy model assigns each agent action to a tier based on business impact, reversibility, and demonstrated confidence.

Critically: agents earn autonomy upgrades through measured performance.

Tier	Mode	Human Role	Example
1	Full Auto	Post-hoc audit only	FAQ responses, data lookups, status checks
2	Supervised	Spot-check sampling (10–20%)	Email drafts, report generation, internal updates
3	Assisted	Pre-approval required	Customer communications, small transactions, schedule changes
4	Manual	Human executes with agent support	Large financial transactions, compliance decisions, patient care

Speed isn't the whole product

The HITL model is a deliberate product decision about where automation adds value and where human judgment remains the product.

Approval checkpoints and confidence thresholds

You'd let a brilliant intern draft the email. You would not let them send it without review if it's going to a regulator.

Approval checkpoints trigger on three signals: confidence score (below calibrated threshold → review), risk classification (high-impact actions always require approval, regardless of confidence), and policy rules (compliance-mandated gates that are non-negotiable).

Most teams start with thresholds that are too permissive, get burned once, and then overcorrect to requiring review on everything. The right answer is calibrated per agent and updated quarterly as you accumulate performance data.

Confidence-Triggered Review

Dynamic threshold

Thresholds calibrated per agent type and action class
Calibration data: compare predicted confidence against actual outcome over 1,000+ runs
Low-confidence flag triggers review; outcome fed back to calibration
Overridden decisions are labeled training data

Risk-Classification Gates

Policy-enforced

High-risk actions require approval regardless of confidence score — no exceptions
Risk classification runs outside the LLM loop
Cedar Policy or Rego for action-level enforcement: if policy says no, the answer is no
Compliance-mandated gates: EU AI Act, SOX, HIPAA each have their own non-negotiable checkpoints

Escalation paths

Design escalation paths before you need them. The SLA-driven auto-escalation is the part most teams forget: if L1 hasn't responded in 30 minutes, the system escalates automatically, not when someone notices the queue is backed up.

HITL escalation flow

Escalation paths with SLA-driven auto-escalation — silence is not approval

Agent executes

Confidence + risk check

Auto-approved

L1: Domain reviewer (SLA: 30 min)

L2: Senior analyst (SLA: 2 hr)

L3: Manager / compliance (SLA: 4 hr)

L4: Executive / legal (SLA: 24 hr)

Show transitions (edges)

Agent executes→Confidence + risk check

Confidence + risk check→Auto-approved

Confidence + risk check→L1: Domain reviewer (SLA: 30 min)

L1: Domain reviewer (SLA: 30 min)→L2: Senior analyst (SLA: 2 hr)

L2: Senior analyst (SLA: 2 hr)→L3: Manager / compliance (SLA: 4 hr)

L3: Manager / compliance (SLA: 4 hr)→L4: Executive / legal (SLA: 24 hr)

How HITL evolves over time

HITL is smart training wheels, the kind that know when to come off.

The maturity path moves from 100% human review → statistical sampling → exception-only → full automation with audit.

Each transition requires crossing data-driven thresholds, not a calendar date or a management mandate. "We've processed 10,000 runs with 99.2% accuracy and zero compliance violations" is an example threshold.

The thing most teams miss: HITL isn't just a safety mechanism, it's your highest-quality training data pipeline.

Every time a human reviews, overrides, or corrects an agent decision, that's labeled ground truth.

Treat it as such. Log it, store it, use it to calibrate thresholds, and eventually use it to fine-tune. The teams that instrument their HITL workflows well end up with dramatically better agents in year two.

HITL implementation checklist

0/6 complete

Define tier classifications for every agent action before going to productionCalibrate confidence thresholds per agent using at least 500 labeled outcomes, not intuitionImplement SLA-driven auto-escalation: unanswered review requests must escalate automaticallyLog every human override as structured training data with outcome labelsRun quarterly autonomy reviews: which Tier 3 actions are candidates for Tier 2 based on performance data?Test escalation paths in staging, a review queue that nobody monitors is worse than no review at all