Loading...

Architecture Pattern

Runtime Rollback Pattern for Agent Tasks (2026)

A production architecture pattern for containing damage when multi-step agent workflows fail mid-execution. Covers saga-style compensation, LangGraph checkpoints, Temporal durable execution, and EU AI Act audit requirements.

Updated 2026-03-13

Problem

Multi-step agent tasks span tools, queues, and third-party APIs in ways that can partially succeed, leaving systems in inconsistent states. In 2026, agentic frameworks like LangGraph Deep Agents and Claude Code 2.1 subagents can spawn isolated child tasks that complete successfully while the parent workflow fails, creating compensation debt the system has no built-in mechanism to detect or resolve.

Use this pattern when a task spans more than one external side-effecting system (API write, database mutation, email send, payment capture) and partial completion creates real business or data integrity risk. Apply it immediately when your agent stack uses async subagents, multi-provider LLM routing, or long-running Temporal workflows where individual activity failures do not automatically roll back preceding completed steps.

Components

  • Explicit task state machine with named phases (PENDING → RUNNING → COMPENSATING → FAILED | COMPLETED)
  • Durable checkpoint store written before and after every side-effecting step (e.g., LangGraph Platform checkpointer or Temporal Workflow History)
  • Compensation action registry mapping each forward action to its inverse or containment handler
  • Failure classifier that distinguishes retryable transient failures from non-retryable semantic failures requiring compensation
  • Saga coordinator that sequences compensation handlers in reverse-chronological order on classified failure
  • Operator escalation surface exposing the full execution trace, failed stage, and compensation outcome for unrecoverable states

Flow

  1. 1Decompose the task into explicit named stages with observable state transitions. Never model a multi-step agent task as a single long opaque prompt-plus-tools sequence.
  2. 2Before each side-effecting step, write a durable checkpoint capturing the current state snapshot, the proposed action payload, and the registered compensation handler for that step.
  3. 3Execute the forward action. On success, write a post-execution checkpoint recording the external system's response and confirming the stage is complete.
  4. 4On failure, invoke the failure classifier: distinguish transient failures (network timeout, rate limit) that are safe to retry with exponential backoff from semantic failures (downstream system rejected the action, business rule violation) that require compensation.
  5. 5For semantic failures, invoke the saga coordinator to execute compensation handlers in reverse-chronological order across all previously completed stages. Compensation handlers must be idempotent — they will be retried if the compensation step itself fails.
  6. 6For operations that are not reversible (e.g., an email already delivered, a payment already captured), apply damage containment: log the irrecoverable state, issue a compensating notification, and flag for operator review rather than attempting a false undo.
  7. 7If compensation itself fails or the state is unrecoverable, escalate to the operator surface with the full execution trace: stage name, checkpoint diff, compensation attempt log, and recommended next action.
  8. 8Persist the complete saga history — forward actions, compensation decisions, and final resolution — to an immutable audit log for compliance and incident reconstruction.

Tradeoffs

Implementation cost vs. incident cost

Every side-effecting step requires a corresponding compensation handler and idempotency contract. This doubles the surface area of task design. The alternative — debugging partial failures forensically in production without checkpoints — is consistently more expensive, especially once real customer data or money is involved.

True rollback vs. damage containment

Not every operation is genuinely reversible. A sent email, a published webhook, or a captured payment cannot be perfectly undone. Design compensation handlers that document exactly which operations are containable (cancel subscription), which require human follow-up (refund request submitted), and which are permanently irrecoverable. Explicitly categorizing these at design time prevents operators from discovering the distinction during an incident.

Checkpoint overhead vs. recovery resolution time

Writing durable checkpoints before and after every step adds latency and storage cost. With Temporal-backed workflows or LangGraph Platform's checkpoint store, this overhead is typically sub-10ms per step but compounds in high-frequency tool call chains. The correct tradeoff depends on how long manual recovery takes without checkpoints — for most production workflows it is not a close call.

Subagent compensation ownership

When LangGraph Deep Agents or Claude Code 2.1 spawn context-isolated subagents, each subagent may complete side effects the parent agent is not aware of. The parent's compensation logic cannot roll back subagent actions it never observed. Subagent tasks must register their side effects and compensation handlers with the parent saga coordinator before execution, not after failure.

Failure Modes

  • A compensation handler is missing for a side effect that matters, discovered only during a production incident rather than at design time.
  • The failure classifier retries a semantic failure (the downstream system correctly rejected the action) instead of triggering compensation, amplifying the damage with repeated invalid calls.
  • Compensation handlers themselves modify shared state without idempotency guards, causing double-compensation when the handler is retried after a transient failure.
  • Subagent-spawned side effects complete successfully but are never registered with the parent saga coordinator, leaving orphaned external state with no compensation path.
  • The audit trail records the compensation decision but not the downstream compensation execution outcome, making compliance reconstruction and incident post-mortems incomplete.

Implementation Notes

  • In LangGraph, implement the saga coordinator as a dedicated graph node that reads the checkpointed state store and dispatches compensation subgraphs in reverse order. Use `interrupt_before` on the compensation dispatcher node when operator confirmation is required before running irreversible containment actions.
  • In Temporal workflows, model compensation as a dedicated Saga activity sequence invoked in the Workflow's catch block. Temporal's Workflow History provides durable checkpoint semantics automatically — your responsibility is registering compensation activities in the correct reverse order before any forward activity executes.
  • Register compensation handlers at task definition time, not at failure time. A compensation registry defined alongside the task state machine makes the rollback surface explicit and reviewable, rather than assembled ad-hoc during incident response.
  • Mark compensation handlers as idempotent by design: each handler must produce the same external result when called multiple times. Use idempotency keys when calling third-party APIs for compensation (e.g., Stripe refund idempotency keys) to prevent double-execution.
  • Expose the full task execution timeline — stage transitions, checkpoint diffs, compensation attempts, and final resolution — in the operator review surface as a structured timeline, not a raw log dump. Operators recovering incidents need to understand the system's state at each stage boundary to make correct decisions under pressure.