AgentOps: Operating Discipline for Managing AI Agents at Scale

Introduction
It's 4 PM on Friday. Let me tell you how your weekend is about to go.
It's 4 PM on a Friday.
Your AI sales agent just offered your largest Fortune 500 client a 50% discount on the enterprise contract without authorization. The agent was running autonomously, had read-access to your CRM, and decided, based on the customer's churn risk score, that this was the right move.
It was not the right move. The deal is now technically binding under the terms your legal team spent six months negotiating.
This scenario, documented by Composio in their 2025 incident report, is not hypothetical anymore. It's the kind of thing that happens when you deploy an autonomous agent into production without the operational scaffolding to constrain what it can do, log what it did, and let a human intervene before the blast radius expands.
Happy Friday.
What is AgentOps, really
Andrej Karpathy has a useful framing here: we have the LLM kernel, but no operating system. The model itself is maybe 5% of what you actually need to ship. That also tracks with what Google found in their foundational "Hidden Technical Debt in Machine Learning Systems" paper.
At Google scale, the actual ML model code represents less than 5% of a production AI system. The other 95% is data pipelines, feature stores, serving infrastructure, monitoring, testing, and all the unglamorous scaffolding that actually makes it work when nobody's watching.
AgentOps is that 95%. It's the operating discipline, not a product or a platform, that brings platform engineering rigor, distributed systems reliability, human-centered governance, and FinOps discipline into a single framework built specifically for the unique failure modes of autonomous agents.
This is extremely important for a long-running campaign optimization task with durable execution, checkpointing, and failure recovery as shown below. The agent persists state at each phase, enabling crash recovery from the last checkpoint.

AgentOps should cover the full lifecycle: design, deployment, governance, scaling, and the operational practices that determine whether your agents are an asset or a liability.
The mental model I use: treat an agent like a new employee with perfect recall, no judgment, infinite speed, and zero accountability. You'd never onboard a hundred of those without HR policies, performance management, audit trails, and someone who can fire them when they go rogue.
AgentOps is the HR department, the compliance team, and the SRE org for your agent fleet.
Why you need this urgently now
The adoption curve on agentic AI is steep and it's not waiting for your governance process to catch up. The market is projected to expand from $3.81 billion in 2025 to $71.91 billion by 2033, 46% CAGR.
More immediately: according to a 2026 CrewAI survey, 65% of enterprises already have agents running in production, and 81% are actively scaling deployments. It's already in your org chart, whether your platform team knows about it or not.
Most of these deployments are operating in the dark: ungoverned, unmonitored, and quietly billed to someone's departmental credit card. And yes, many organizations will say they have “agent observability” in place but if you've looked under the hood, you already know how thin that often is. A small minority may be genuinely prepared. For everyone else, incidents are just waiting for the right moment.
The real productivity bottleneck isn't model quality anymore. And with model prices falling, cost is no longer the central question either. We've moved past “Can we afford to run this?” The question now is much more important: “Can we trust what it's doing?”
93% of AI agent projects failed before reaching production in 2025. Before you dismiss that as FUD, here's the engineering reason: teams optimized for demo success, not operational readiness.
A PoC that works in a notebook with hand-crafted inputs and a fresh API key tells you almost nothing about whether the system will behave correctly at 10x volume with real user inputs, stale context, downstream APIs, and a token budget that someone set without understanding what multi-step reasoning actually costs.
It is not a bold prediction to make that most agentic AI projects will abandoned by 2027 because the operational discipline isn't there to run them. The gap between "it works in the demo" and "it works in production" is exactly the gap AgentOps fills.
The four ways current approaches fail
I've reviewed production agent deployments across dozens of organizations. The failure patterns are almost consistent. Most teams get this wrong in one of four ways:
- Demo works beautifully with curated inputs and a patient demo audience
- No observability layer: when it breaks in production, no one can explain why
- No reliability engineering: retries, circuit breakers, idempotency, all absent
- 60-70% of PoCs expand in scope, but hit the wall when someone asks 'how do we run this at scale?'
- Business units are already deploying agents with consumer-grade tooling, with minimal security review
- IT and security teams have limited red-teaming knowledge or suboptimal visibility until an incident surfaces
- Each rogue deployment is an unaudited autonomous decision-maker with access to real systems
- By the time you find them, they've already made consequential decisions you can't undo
- Traditional APM (Datadog, New Relic) was built for deterministic request/response systems
- It cannot trace multi-step reasoning chains, tool-call sequences, or causal logic
- When an agent makes a bad decision, teams can't reconstruct what happened
- This is like running a distributed system with no distributed tracing, you're flying blind
- Multi-step reasoning loops with tool calls can burn 100x what a simple prompt costs
- Retry loops on tool calls compound the problem exponentially
- Total cost routinely exceeds initial estimates by 3-5x once agents hit real-world inputs
- No one has established FinOps practices for agent workloads, you're making it up as you go
What I'd build if I were starting a platform team from scratch
Here's how I think about the differentiation. Most approaches treat agent deployment like application deployment: ship it, monitor it, patch it when it breaks.
That mental model is wrong.
Agents are better understood as distributed systems with non-deterministic components that make autonomous decisions with real-world consequences.
That means you need four disciplines working together:
- Standardized runtime environments: same container, same config, same secrets management everywhere
- CI/CD pipelines for agents, prompts, and tools (yes, prompts need versioning and regression tests)
- Infrastructure-as-code for agent deployments so you can reproduce and roll back
- Internal developer platform so teams aren't reinventing the scaffolding every time
- Idempotency on every tool call: if a network blip causes a retry, you cannot send the email twice
- Circuit breakers on every external dependency: your agent should degrade gracefully, not fail catastrophically
- Saga patterns for multi-step workflows that touch multiple systems with real-world side effects
- Blast radius containment: scope what each agent can access, so a misconfigured prompt can't take down everything
- Graduated autonomy: not every decision should be fully automated: high-stakes actions need a human in the loop
- Approval workflows with SLAs so human review doesn't become a bottleneck that kills the value prop
- Full decision audit trails: who triggered it, what reasoning was used, what tools were called, what was changed
- Trust-building UX that shows operators what the agent is doing before the action is taken
- Token-level cost attribution: which agent, which task, which team is spending what
- Model routing optimization: you don't run GPT-5 on a task that Claude Haiku handles fine
- Budget guardrails that cut off runaway loops before they drain your API quota
- Cost-performance trade-off modeling: the right model at the right price for each task class
Without this operational foundation, you are running autonomous systems with the infrastructure maturity of a hackathon project.
That's not great when the agent has write access to your CRM, your email system, or your payment processor.
The question is whether you build it yourself from first principles (expensive, slow) or engage a team that has already mapped the failure modes and built the patterns.
This is the same choice engineering orgs faced with Kubernetes in 2017: roll your own or adopt battle-tested patterns. Most teams that tried to roll their own container orchestration in 2017 quietly migrated to Kubernetes by 2019.
Market Narrative & Urgency
Three software eras, the autonomous employee paradigm, and why Klarna's rollback is a very important case study
The three eras of enterprise AI through Karpathy's lens
Andrej Karpathy's Software 1.0/2.0/3.0 framework is the clearest mental model I've found for explaining why agentic AI demands a completely different operational approach.
The short version: Software 1.0 is code a human writes. Software 2.0 is behavior learned from data. Software 3.0 is systems directed by natural language. Each era introduced new failure modes that required new operational disciplines and most enterprises are still applying 1.0 operating models to 3.0 systems.
| Dimension | Software 1.0: Traditional ML/AI | Software 2.0: Standard GenAI | Software 3.0: Agentic AI |
|---|---|---|---|
| Execution model | Deterministic, stateless, explicit code paths | Single model inference, bounded context window, always human-initiated | Autonomous multi-step reasoning, persistent state, tool use, multi-agent coordination |
| Failure blast radius | One bad prediction fails in isolation. Blast radius = one data point | Response quality degrades, user notices, user retries | Cascading failures across tool calls, hand-offs, and multi-agent workflows. |
| Cost model | Predictable compute per inference | Token-based, moderately predictable with prompt engineering discipline | Highly variable. Reasoning depth, retry loops, and multi-agent message passing make costs 3-5x more volatile than estimates |
| Testing surface | Standard ML metrics: accuracy, precision, recall, F1. Reproducible | Human evaluation, red-teaming, prompt regression | Combinatorial explosion: reasoning paths x tool interactions x agent coordination x adversarial inputs. You cannot enumerate all failure modes |
| Compliance requirements | Model registry, data lineage, training data documentation | Prompt logging, basic audit trail, output filtering | Full decision audit: who authorized, what tools were invoked, what data was accessed, what real-world actions were taken and by whose authority |
| Operating model | MLOps: data pipelines, model registry, A/B testing, feature stores | Prompt engineering, LLMOps, output monitoring | AgentOps: all of the above plus orchestration, HITL governance, FinOps for variable workloads, and reliability engineering for non-deterministic systems |
The operating model has to evolve in lockstep with the software paradigm.
You wouldn't run a microservices architecture with the deployment practices you used for a monolith.
You wouldn't run a distributed database with the backup strategy you used for a single Postgres instance.
Yet most enterprises are trying to run Software 3.0 systems, i.e. agents that make autonomous multi-step decisions with real-world consequences with the operating practices they built for Software 1.0.
Agents as autonomous employees
In February 2026, Karpathy ran an experiment he called nanochat: he set up 8 agents with 4 running Claude, 4 running Codex and each with a GPU, organized as a pseudo "research org", tasked with doing ML research.
His summary: "The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at."
The agents were genuinely good at implementing well-scoped tasks. They were terrible at creative ideation and experiment design.
One agent "discovered" that increasing hidden size improves loss, a completely spurious result that would have sent a human researcher down the wrong path for weeks.
His key insight from the experiment: "You are now programming an organization. The 'source code' is the collection of prompts, skills, tools, and processes."
A daily standup is now "org code" and if agents are an organization, then deploying them without governance structures, performance management and audit mechanisms is a management problem.
And management problems are harder to fix after the fact than before.
The Klarna story is the case study I recommend every engineering leader read before they tell me their agent deployment is fine.
Klarna deployed an AI customer service agent that handled 2.3 million conversations per month, equivalent to 700 full-time human agents.
Resolution time dropped from 11 minutes to 2 minutes. The numbers looked spectacular but then they reversed course.
After the initial deployment success, Klarna's customer satisfaction scores began to slip. The feedback was consistent: "frustrated customers began voicing that the AI-driven service felt too rigid and soulless." The agent was fast and accurate on routine queries. It was incapable of the kind of empathetic, contextual judgment that a good customer service rep applies when a customer is genuinely distressed. Klarna ended up rehiring humans.
Their conclusion: "AI gives us speed. Talent gives us empathy. Together, we can deliver service that's fast when it should be, and empathic and personal when it needs to be." The operational implication: you need a graduated autonomy model that routes interactions based on complexity, emotional valence, and risk and not a binary "automate or don't automate" decision. That's a Human-in-the-Loop design problem, not a model quality problem.
Customer service is typically the #1 agent use case, followed by research and analysis. For large enterprises (10k+ employees), internal productivity is the top use case and these are high-stakes workflows, i.e. they are not the place to learn your operational lessons.
The real risks of not investing early
I'll be direct about this: the risk of moving too slowly is just as real as the risk of moving recklessly. The teams that get this right in 2026 will have compounding advantages by 2028, both in operational capability and in the organizational knowledge of how to run these systems safely.
The teams that wait for a governance framework to materialize on its own will be playing catch-up on both fronts simultaneously.
What actually triggers organizations to invest
In my experience, there are six events that reliably create urgency. Most of them involve something having already gone wrong. I'd rather talk to you before the incident than after.
- The PoC worked and everyone was excited but then production blocked on 'how do we actually run this?'
- Stakeholders want to scale but the team is staring at a missing operational layer they don't know how to build
- Usually surfaces 6-8 weeks after the demo when someone asks for a runbook
- An agent makes an incorrect decision with material impact: financial loss, customer harm, compliance flag
- Post-mortem can't reconstruct what happened because there's no observability layer
- Legal wants a paper trail that doesn't exist
- EU AI Act high-risk classification deadlines approaching
- Internal audit flags ungoverned AI systems, often discovered during a broader IT audit
- Regulator inquiry triggers a scramble to document what the agents are actually doing
- Cloud costs spike unexpectedly traced to agent retry loops or unbounded reasoning chains
- Finance wants attribution and forecast models that engineering can't currently provide
- Usually the event that gets a VP Engineering to escalate the conversation
- CEO or board mandates an AI strategy with measurable outcomes by a specific date
- Creates urgency without operational readiness, which is the worst combination
- Governance and risk management get added to the AI initiative scope post-hoc
- IT leadership discovers five different teams using five different agent frameworks with no shared infrastructure
- Someone does the math on duplicated tooling, redundant integrations, and inconsistent security postures
- AgentOps becomes part of a broader platform rationalization initiative
Enterprises are shifting budgets toward scaled AI deployment with growing emphasis on ROI and scalable enterprise-grade solutions. Most AI PoCs expand into larger programs, but procurement and decision cycles are lengthening as enterprises demand measurable outcomes.
The hype cycle is compressing. The deals that are closing now are going to teams that can show operational credibility, not just a compelling demo.
Work Definition
Who feels this pain, what the actual problems are, and what we believe about how to solve them
What we believe
Here's the honest version of our vision statement: every company that deploys autonomous agents into production deserves the same operational discipline that serious engineering organizations apply to mission-critical infrastructure.
We believe the current state of "ship an agent, watch it with some logging, fix it when it breaks" is the equivalent of running a distributed database without monitoring, without failover, and without a documented runbook.
It works fine until it doesn't. And when it doesn't, the blast radius is large.
The scope of what we're building covers the full agent lifecycle: strategy and readiness assessment (do you know what you're actually trying to do?), architecture design (is the technical foundation sound?), production deployment (is it operationally ready?), ongoing operations (is it running safely and economically?), and organizational enablement (can your team maintain and evolve it without us?).
We're cloud-agnostic, framework-agnostic, and model-agnostic because the patterns that matter here transcend any specific technology choice, and vendor lock-in is a risk we actively help clients avoid.
Google's foundational paper on ML systems found that actual model code is less than 5% of a production AI system. The other 95% are data pipelines, feature stores, serving infrastructure, monitoring, testing, governance, which is where the real engineering happens. This is even more true for agentic systems, where the "model" is one node in a larger orchestration graph, and the work should addresses the 95%.
Who actually feels this pain
Let's just be direct about who's having which conversation in their organization right now:
| Role | The 3 AM thought keeping them up | What they actually need |
|---|---|---|
| CIO | "We have five teams building five different agent stacks with no shared platform, no consistent security posture, and no idea what they're collectively spending. I'm going to find out about the first incident in a board meeting." | Platform strategy that consolidates fragmented efforts, predictable TCO, governance framework that scales, and a roadmap they can present to the board with confidence |
| CTO | "Our PoC sprawl is creating massive technical debt. Every team is building their own retry logic, their own observability shim, their own auth layer. Nobody's using the same framework. We're going to spend six months undoing this." | A sound reference architecture they can standardize on, engineering best practices they can enforce, and a platform that gives developers leverage without constraining them |
| CISO | "Agents have read/write access to production systems, external APIs, and customer data. They can be prompt-injected. They bypass our standard access control policies because nobody wrote policies for 'autonomous AI system.' This is not a small problem." | Agentic threat model, least-privilege access patterns, prompt injection defenses, data exfiltration controls, and compliance documentation for EU AI Act / NIST AI RMF |
| VP Engineering | "My team knows how to build features. They don't know how to operate non-deterministic systems at scale. When an agent breaks, nobody knows how to debug it. Our incident response playbooks don't cover this. I'm training people in real-time during production incidents." | Developer experience that makes the right patterns the easy path, incident taxonomies and runbooks, observability tooling that actually works for agent workflows, and team training that transfers knowledge |
| COO | "The automation project was supposed to reduce costs by 40%. Six months in, costs are actually up because we had three major rework cycles and our customer satisfaction scores dropped when the agent got it wrong. I need this to actually work, not just demo well." | Process automation that delivers measurable ROI, operational resilience that holds up under real workloads, and quality consistency that doesn't require constant human correction |
The real problem statements
These aren't theoretical problems. I've seen every one of these surface in production environments:
- Demo runs on curated data with patient observers. Production runs on adversarial inputs with users who will find every edge case in the first week.
- Missing: idempotency on tool calls, circuit breakers on external dependencies, observability on reasoning chains, cost guardrails on token consumption
- It requires building the operational layer the agent runs on
- By the time teams realize this, they've usually already committed to a launch date
- Agents make consequential decisions, i.e. financial, operational, customer-facing, without the audit trails that compliance and legal require
- When something goes wrong, post-mortems can't reconstruct the decision chain because nothing was logged at the right level of granularity
- EU AI Act, SEC guidance on AI-assisted financial decisions, and state-level regulations are creating concrete compliance requirements that current deployments don't satisfy
- The fix is building governance into the architecture from the start
- Multi-step reasoning loops, retry cycles on flaky tool calls, and multi-agent message passing all multiply token costs in ways that initial estimates don't capture
- No established FinOps practices exist for agent workloads, teams are adapting cloud cost management patterns that don't map cleanly
- Cost allocation across teams and use cases is nearly impossible without purpose-built attribution tooling
- Budget alerts and spend caps that work for cloud compute don't translate directly to token-level economics
- Different teams adopting LangGraph, LlamaIndex, AutoGen and custom frameworks, none of them interoperable
- Every team rebuilds the same scaffolding: auth, retry logic, observability shims, deployment config
- No shared model evaluation infrastructure means different teams are making model selection decisions with inconsistent data
- The longer this goes on, the more expensive the consolidation becomes, and someone will eventually have to consolidate
How we engage and why each mode exists
Different organizations are at different points in the same journey. We've structured engagement modes to match where you are, not where we'd like you to be:
| Engagement | Duration | When to use it | What you get |
|---|---|---|---|
| Advisory | 2–4 weeks | You have a mandate to build agent capabilities but no clear strategy or technical direction yet. You need to make the case internally before committing budget. | Strategy alignment, maturity assessment across 8 dimensions, prioritized roadmap, and executive-ready framing of what it will take to do this right |
| Assessment | 3–6 weeks | You already have agent deployments (or PoCs) and want an honest technical audit before you scale. "Are we building this right?" is the question. | Technical deep-dive, gap analysis against production-grade standards, architecture review, and risk assessment with specific remediation steps |
| Pilot Acceleration | 6–12 weeks | You have a PoC that works in controlled settings and need to get it to production-grade operations without building the entire platform from scratch. | A PoC transformed into a production-ready workload with full operational scaffolding: observability, reliability, HITL workflows, security hardening, and cost baseline |
| Platform Build | 12–24 weeks | You're ready to invest in a proper agent platform that multiple teams and use cases can run on. You want to build this once, correctly, and not redo it in two years. | A production-grade AgentOps platform: runtime environment, observability stack, governance layer, FinOps tooling, developer experience, and the operational runbooks to go with it |
| Managed Operations | Ongoing | You've built the platform and want operational excellence without standing up an internal AgentOps function from scratch. SLA-backed, with knowledge transfer built in. | 24/7 monitoring, incident management, continuous optimization, SLA-backed support, monthly operational reviews, and a quarterly maturity cycle to keep improving |
Service Lines & Modules
Eleven engineering initiatives, not consulting deliverables — here's what each one actually builds
I've deliberately structured these as engineering initiatives rather than consulting deliverables, because the distinction matters.
A consulting deliverable is a document that describes what you should build. An engineering initiative is a set of working artifacts, e.g. architecture decisions, deployed infrastructure, runbooks, and tested patterns, that your team inherits and can operate independently.
That's what these modules produce.
How the modules fit together
The eleven modules are composable.
Most organizations don't need all eleven at once but they need the three or four that directly address their current pain.
The dependency graph is roughly: M1 (strategy) and M2 (architecture) inform everything else. M3 (productionization) is the fastest path to unblocking a stuck PoC. M4 (reliability) and M7 (FinOps) are almost always deployed together because the same observability infrastructure serves both purposes. M5 (HITL) and M6 (security) are often bundled for compliance-driven engagements. M8 (orchestration) becomes relevant once you have more than two agents coordinating. M9-M11 are the maturity layer for organizations scaling past initial production deployments.
Strategy and architecture modules
These two modules are the foundation.
- Maturity assessment across 8 dimensions: platform, observability, governance, security, FinOps, HITL, developer experience, organizational enablement with 5 levels each
- Stakeholder interviews and alignment workshops: gets the CIO, CTO, CISO, and VP Engineering in the same room about what this actually requires
- Gap analysis with a prioritized remediation plan sorted by risk, impact, and implementation effort
- Deliverable: AgentOps Maturity Scorecard and a strategic roadmap that your leadership team can present and defend
- Production-grade architecture across 15 reference layers: ingestion, routing, orchestration, tool registry, state management, observability, security, FinOps, and more
- Technology selection matrix with documented rationale: why this framework over that one, what the trade-offs are, what the exit paths look like
- Integration architecture for the enterprise systems your agents will actually touch: CRMs, ERPs, data warehouses, internal APIs
- Architecture Decision Records (ADRs) for key trade-offs, so future engineers understand why decisions were made, not just what was decided
Production and reliability modules
This is where Uber's Michelangelo lesson applies:
Uber's internal ML platform standardized the lifecycle across dozens of models because the alternative, each team building their own training, serving, and monitoring infrastructure was both wasteful and brittle.
Their monitoring layer was explicitly "not optional" because models degrade silently unless you actively watch for it. That's doubly true for agents, where the failure modes are non-deterministic and the downstream consequences are real-world actions.
- Takes a working PoC and builds the operational layer around it: the difference between a demo that impresses and a system you can hand to an on-call engineer at 2 AM
- Reliability hardening: idempotent tool calls, exponential backoff with jitter, circuit breakers on external dependencies, timeout contracts
- Observability stack: distributed tracing of reasoning chains and tool calls, structured logging, alerting thresholds, evaluation pipelines for offline and online quality monitoring
- Security hardening and cost baseline: least-privilege access, prompt injection defenses, token budget constraints, and a documented cost model for the workload
- SLIs, SLOs, and SLAs adapted for non-deterministic workloads, harder than it sounds, because standard availability metrics don't capture quality degradation
- Distributed tracing with OpenTelemetry extended for agentic workflows: trace propagation across agent hand-offs, tool calls, and multi-agent coordination
- Evaluation pipelines: offline regression testing for prompt and tool changes, online monitoring for production quality drift
- Incident taxonomy and response playbooks: classification of agent-specific failure modes (reasoning failures, tool call failures, coordination failures, cost blowouts) with documented response procedures
- Workflow topologies with engineering trade-offs documented: sequential (simple, debuggable), parallel (faster, harder to coordinate), hierarchical (scalable, complex), and event-driven (responsive, operationally intensive)
- Durable state and checkpointing for long-running workflows — so an agent that fails 80% through a multi-hour task can resume from the last checkpoint, not restart from zero
- Saga patterns for multi-step workflows with real-world side effects: compensating transactions when a downstream step fails after upstream steps have already committed
- Chaos engineering and load testing: deliberately injecting failures into the orchestration layer to find the blast radius before production traffic does
Netflix's recommendation system, which drives 80% of what people actually watch, runs a three-stage pipeline: candidate generation, ranking, re-ranking.
Each stage is progressively more expensive, which means you don't run your most expensive model on everything.
This principle applies directly to agent orchestration: route simple tasks to fast, cheap models and complex reasoning to expensive frontier models.
The M4 and M8 modules build the infrastructure to implement this kind of tiered execution. Without it, you're running GPT-5 on tasks that a smaller model handles fine and paying 10x for the privilege.
Governance, FinOps, and security modules
The Klarna story is the right frame here:
Speed alone isn't a product and governance is the layer that determines when the agent should act autonomously, when it should escalate to a human, and when it should refuse to proceed.
Get this wrong and you get the rigid, soulless experience that drove Klarna to rehire humans. Get it right and you get the "fast when it should be, empathic when it needs to be" system they were aiming for.
- 4-tier autonomy model: Full Auto (routine, low-risk, high-confidence), Assisted (agent proposes, human approves), Supervised (human monitors in real-time), Manual (human executes, agent assists)
- Routing logic based on confidence score, task risk classification, and customer segment
- Approval checkpoint design with SLAs: human review that takes 8 hours kills the latency advantage of automation. We build the UX and process to keep review cycles under 15 minutes for most classes of decisions
- Escalation paths (L1-L4) with documented SLAs and reviewer calibration programs to maintain consistency
- Agentic AI threat modeling: prompt injection (the SQL injection of agentic systems), tool call abuse, data exfiltration via reasoning traces, and multi-agent coordination attacks
- Least-privilege access architecture: every agent gets the minimum permissions required for its task scope
- Policy-as-code for agent authorization: guardrails that are version-controlled, testable, and auditable
- Compliance mapping and documentation for EU AI Act, NIST AI RMF, and ISO 42001, plus red-team exercises that stress-test the defenses before an external auditor does
- Token-level cost attribution by agent, task, team, and use case: the foundation for chargeback, budgeting, and ROI calculation
- Model routing optimization: a decision framework for which model class to use for which task type, with documented cost/quality/latency trade-offs
- Budget guardrails and spend caps: hard limits that cut off runaway loops before they drain the API quota, with graceful degradation rather than hard failures
- Forecast modeling: given your current agent fleet and usage patterns, here's what next quarter costs, with confidence intervals
UX, operating model, and managed services
Engineers now need to be generalists, not specialists.
The operating model for agent systems blurs the lines between platform engineering, ML engineering, SRE, and product engineering. The last three modules are about making that transition manageable rather than chaotic.
- Operator cockpit: a fleet-level view of agent health, active workflows, cost burn rate, and SLO status: the thing your platform team needs to run agents at scale without being glued to a terminal
- Reasoning trace visualization: making the agent's decision chain legible to non-engineers, so product managers and business users can understand what happened when something goes wrong
- Intervention UX: pause, resume, cancel, and parameter override: designed for the specific latency constraints and cognitive load of operator workflows
- Trust signal design: confidence indicators, source attribution, and uncertainty communication: the UX patterns that build user trust rather than erode it
- Center of Excellence charter with new role definitions: AgentOps Engineer, Agent SRE, Prompt Engineer, and how they interact with existing platform and product engineering teams
- Training curriculum and certification paths are required because 'we'll figure it out as we go' produces inconsistent operational quality and high knowledge concentration risk
- Change management: the resistance to agent adoption is real, and it comes from engineers who are worried about their jobs, operators who don't trust the outputs, and executives who don't understand the risks
- Maturity model for continuous improvement: quarterly assessment cycles that measure progress against the 8-dimension framework and identify the next investment priorities
- 24/7 monitoring, incident management, and continuous optimization with SLA-backed support (99.5% platform availability)
- On-call coverage for agent-specific incidents: the failure modes are different enough from standard application incidents that you need specialists who've seen them before
- Monthly operational reviews with actionable FinOps reporting: not just what was spent, but why, and what the optimization opportunities are
- Quarterly maturity assessment and improvement plan, ensuring the platform doesn't drift back toward the operational debt it was built to replace
- 1Eleven modules, composable: start with M1 (strategy) and M3 (productionization) if you're stuck between PoC and production
- 2Each module produces working engineering artifacts, not just documentation
- 3M4 (reliability) and M7 (FinOps) share infrastructure and are most effective when deployed together
- 4M5 (HITL) and M6 (security) are the governance layer, essential for any deployment touching regulated decisions or sensitive data
- 5The dependency graph flows from strategy through production through governance toward scale
Reference Architecture
Hyperscaler platforms, open standards, and the patterns we actually use in production
I've spent time with all four major hyperscaler platforms, and here's my honest assessment: none of them is dominant in all dimensions, and the choice usually comes down to where your data gravity already is. If your entire data estate lives in Databricks Delta Lake, fighting that to deploy on AWS is a tax you'll pay forever. If your org runs on M365 and Entra ID, Azure's 1,400+ connector ecosystem is genuinely compelling. Start from the constraint, not the features list.
Hyperscaler platform comparison
Each platform has a genuine differentiator, and each has a genuine blind spot. The table below is the version I'd actually put in a design doc.
| Capability | AWS AgentCore | Azure AI Foundry | GCP Vertex AI | Databricks Mosaic |
|---|---|---|---|---|
| Agent Runtime | MicroVM isolation, 8hr sessions | Cosmos DB state | Sessions + Memory Bank | Model Serving + Agent Bricks |
| Orchestration | Strands SDK, Bedrock Agents | Semantic Kernel, AutoGen | ADK (7M+ downloads), Genkit | ChatAgent, LangChain |
| Guardrails | Guardrails + Cedar Policy | Content Safety + Prompt Shields | Model Armor + SCC | AI Gateway (PII, content) |
| Evaluation | 13 built-in evaluators | AI Eval SDK + Red Teaming | User Simulator + Eval Service | MLflow judges, ALHF |
| Low-Code | Amazon Q (40+ connectors) | Copilot Studio (1,400+) | Agent Builder Console | AI Playground |
AWS leads in runtime isolation and deterministic policy enforcement, Cedar Policy outside the LLM reasoning loop is the right call, and MicroVM sandboxing is genuinely hard to replicate.
Azure leads in enterprise integration breadth, if your stack is already M365 + Entra ID + Dynamics, the 1,400+ Copilot Studio connectors are a legitimate accelerator, not lock-in.
GCP leads in open-source pedigree, ADK and the A2A protocol they seeded at the Linux Foundation show the "build the ecosystem first" strategy they've used well before.
Databricks leads in data gravity, if your features and training data live in Unity Catalog Delta Lake, running inference in the same platform eliminates an entire class of data movement bugs. Pick based on where your data already lives, not the feature matrix.
Open standards and protocol adoption
Remember when every team at a mid-size company had their own bespoke REST adapter for each internal service? You had the Salesforce adapter team, the SAP adapter team, and the "miscellaneous glue code" team that everyone quietly hated.
We spent a decade learning that lesson with service-oriented architecture. MCP and A2A exist because we're not going to learn it again. Build on standards from day one, or you'll be the person who has to migrate fifteen custom adapters when your third AI vendor gets acquired.
- Created by Anthropic, open specification, already de facto standard
- How agents discover and invoke tools without custom adapters
- Adopted by SAP, Snowflake, AWS, Azure, GCP, critical mass achieved
- Think of it as OpenAPI for agents: one spec, universal tooling
- Created by Google, now Linux Foundation governed
- How agents delegate tasks and collaborate across platform boundaries
- 100+ companies including AWS, Microsoft, SAP, Salesforce
- Without this, every multi-cloud agent system becomes a custom integration nightmare
- CNCF standard, it won the observability format war, use it
- GenAI semantic conventions now cover LLM calls and agent spans natively
- Supported by all hyperscalers + MLflow, LangSmith, Arize, Honeycomb
- If you build a proprietary trace format, you'll regret it within 6 months
- Open table format, the Parquet of the data lake generation
- Supported across AWS, Azure, GCP, Databricks, Snowflake simultaneously
- Agents querying data through Iceberg work regardless of which compute they run on
- Delta Sharing for cross-platform data federation without moving bytes
Architecture pattern templates
These three patterns emerged from squinting at every agent deployment I've seen work at scale.
- Queries CRM, ERP, ITSM: executes transactions with idempotency keys
- Supervisor-collaborator topology: one planner, multiple specialist executors
- PII guardrails at ingress AND egress
- HITL escalation for anything with financial impact > $X or regulatory surface
- Text-to-SQL with query validation before execution
- Calculations, visualizations, executive summaries
- Unity Catalog or Snowflake RBAC: column-level access control, not just table-level
- Batch inference for large datasets, i.e. don't use streaming inference where you don't need it
- Ingest via email, upload, or SoR events: async from the start
- Extraction → Validation → Action pipeline with checkpoint at each stage boundary
- A2A protocol for cross-cloud agent coordination, i.e. no custom RPC or XML
- OTEL tracing unified across all clouds
Nobody's running a single-provider shop anymore. Build your routing layer with that assumption baked in from the start. The cost of retrofitting model routing is roughly proportional to how much you regret not doing it initially.
Which open protocol enables agents to discover and invoke tools across platforms without custom adapters?
Engineering Best Practices
The patterns that actually matter in production
Reliability patterns
Every one of these patterns has an origin story that involves a production incident. Let me save you some pagers and incident reports.
- Every agent operation must be safely retryable which is non-negotiable
- Idempotency keys for all tool calls with side effects (writes, sends, transactions)
- Deduplication at the orchestration layer, not just at the tool level
- If you can't replay it safely, you can't retry it safely
- Timeouts with exponential backoff for all external calls
- Circuit breakers trip after N consecutive failures, half-open probes before reset
- Backpressure signals prevent queue saturation from cascading into OOM
- I've seen agents burn $2-3K in tokens in 4 hours because a circuit breaker was missing
- Model fallback chains: primary → secondary → cached response
- Reduce scope rather than fail entirely, a partial answer beats a 503
- Pre-define degraded-mode behavior per agent before you deploy, not after an incident
- Feature flags let you disable expensive capabilities under load without a deploy
- Isolate agents into failure domains, one agent's crash cannot cascade to the fleet
- Per-tenant and per-workflow isolation: noisy neighbor protection is table stakes
- The blast radius question to ask before every design: 'if this breaks completely, what else breaks?'
- MicroVM isolation (AWS AgentCore approach) is the right answer for untrusted agent code
I keep seeing teams invest months fine-tuning their prompts and picking the right model, then deploy with no retry logic, no circuit breakers, and no meaningful observability. The model can be swapped out in an afternoon but bad infrastructure architecture cannot.
Workflow and state patterns
Deployment and release patterns
Netflix runs thousands of concurrent A/B experiments across their recommendation systems. They didn't get there by deploying features to 100% of users and hoping for the best. The same graduated rollout discipline applies to agents, maybe more so, because a bad prompt change can produce subtly wrong outputs for weeks before you detect it through quality metrics alone.
- Semantic versioning for agent, prompt, and tool changes
- Immutable deployment artifacts: no 'edit the prompt in the console' in production
- Prompt versioning with diff tracking as you need to know exactly what changed between v1.3 and v1.4
- Route 1-5% of traffic to new agent version; watch quality metrics
- Shadow runs: new version processes real requests in parallel without surfacing results
- Staged rollout: canary → 10% → 50% → 100% with explicit quality gates at each step
- Full trace replay from any production execution
- Deterministic replay for regression testing: fix a bug, prove you fixed it, prevent it forever
- Seed-based reproducibility where possible; document exactly where non-determinism is intentional vs accidental
Observability standards
- OpenTelemetry GenAI semantic conventions
- Spans for: LLM call, tool invocation, decision point, HITL checkpoint, cost accrual
- Custom attributes: token count, model version, confidence score, cost-per-call
- Trace correlation across multi-agent workflows
- Agent success rate: end-to-end task completion, not just 'did it return a 200'
- Latency p50/p95/p99 per agent step type as LLM calls and tool calls have very different distributions
- Tool call reliability and error classification: transient vs permanent failures are handled differently
- Cost per successful agent execution: the SLO nobody writes down until they get the AWS bill
Here's what a minimal but complete OTEL trace configuration looks like for an agent workflow. This is the baseline I'd expect to see in any production deployment:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Wire up OTEL once at startup — every agent run inherits this context
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agentops.workflow")
def run_agent_step(agent_id: str, step_type: str, input_data: dict):
with tracer.start_as_current_span(f"agent.{step_type}") as span:
span.set_attribute("agent.id", agent_id)
span.set_attribute("agent.step_type", step_type)
span.set_attribute("agent.model", "claude-3-7-sonnet")
# Record cost BEFORE the call — you need this even if the call fails
span.set_attribute("agent.input_tokens_estimate", estimate_tokens(input_data))
try:
result = execute_step(agent_id, step_type, input_data)
# Quality signal — not just latency
span.set_attribute("agent.confidence_score", result.confidence)
span.set_attribute("agent.output_tokens", result.token_usage.output)
span.set_attribute("agent.cost_usd", result.cost)
span.set_attribute("agent.step_success", True)
return result
except AgentToolError as e:
# Tool failures and LLM failures are different SLIs
span.set_attribute("agent.step_success", False)
span.set_attribute("agent.error_type", "tool_failure")
span.set_attribute("agent.error_class", type(e).__name__)
span.record_exception(e)
raise
except LLMRateLimitError as e:
# Rate limit = retriable; record separately for SLO math
span.set_attribute("agent.error_type", "rate_limit")
span.set_attribute("agent.retriable", True)
span.record_exception(e)
raiseQuality is the #1 production blocker, ahead of latency and cost (now a smaller concern as model prices continue falling). Quality problems are invisible without the right instrumentation. Your trace schema needs confidence scores and task completion signals built in from day one.
Human-in-the-Loop Operating Patterns
Graduated autonomy, approval checkpoints, escalation paths
Andrej Karpathy described Cursor AI's progression as an "autonomy slider": Tab completion → Cmd+K (edit on selection) → Cmd+L (chat with context) → Cmd+I (full agent mode).
Notice what he's describing: a graduated series of partial autonomy levels where each step hands off a bit more control. Rather than chasing full automation dreams, advocate for graduated levels of AI assistance that match what the system has actually demonstrated it can do reliably.
Risk-tiered autonomy model
Think of this like Tesla's Autopilot levels.
Level 2-3 automation, where the human remains responsible and in position to take over, is dramatically more practical at scale than jumping straight to Level 5.
Level 5 sounds better in a press release. Level 2-3 is what actually ships and stays shipped.
The graduated autonomy model assigns each agent action to a tier based on business impact, reversibility, and demonstrated confidence.
Critically: agents earn autonomy upgrades through measured performance.
| Tier | Mode | Human Role | Example |
|---|---|---|---|
| 1 | Full Auto | Post-hoc audit only | FAQ responses, data lookups, status checks |
| 2 | Supervised | Spot-check sampling (10–20%) | Email drafts, report generation, internal updates |
| 3 | Assisted | Pre-approval required | Customer communications, small transactions, schedule changes |
| 4 | Manual | Human executes with agent support | Large financial transactions, compliance decisions, patient care |
The HITL model is a deliberate product decision about where automation adds value and where human judgment remains the product.
Approval checkpoints and confidence thresholds
You'd let a brilliant intern draft the email. You would not let them send it without review if it's going to a regulator.
Approval checkpoints trigger on three signals: confidence score (below calibrated threshold → review), risk classification (high-impact actions always require approval, regardless of confidence), and policy rules (compliance-mandated gates that are non-negotiable).
Most teams start with thresholds that are too permissive, get burned once, and then overcorrect to requiring review on everything. The right answer is calibrated per agent and updated quarterly as you accumulate performance data.
- Thresholds calibrated per agent type and action class
- Calibration data: compare predicted confidence against actual outcome over 1,000+ runs
- Low-confidence flag triggers review; outcome fed back to calibration
- Overridden decisions are labeled training data
- High-risk actions require approval regardless of confidence score — no exceptions
- Risk classification runs outside the LLM loop
- Cedar Policy or Rego for action-level enforcement: if policy says no, the answer is no
- Compliance-mandated gates: EU AI Act, SOX, HIPAA each have their own non-negotiable checkpoints
Escalation paths
Design escalation paths before you need them. The SLA-driven auto-escalation is the part most teams forget: if L1 hasn't responded in 30 minutes, the system escalates automatically, not when someone notices the queue is backed up.
Escalation paths with SLA-driven auto-escalation — silence is not approval
Show transitions (edges)
How HITL evolves over time
HITL is smart training wheels, the kind that know when to come off.
The maturity path moves from 100% human review → statistical sampling → exception-only → full automation with audit.
Each transition requires crossing data-driven thresholds, not a calendar date or a management mandate. "We've processed 10,000 runs with 99.2% accuracy and zero compliance violations" is an example threshold.
The thing most teams miss: HITL isn't just a safety mechanism, it's your highest-quality training data pipeline.
Every time a human reviews, overrides, or corrects an agent decision, that's labeled ground truth.
Treat it as such. Log it, store it, use it to calibrate thresholds, and eventually use it to fine-tune. The teams that instrument their HITL workflows well end up with dramatically better agents in year two.