Durable Execution in Agentic Systems: Making Failures Irrelevant
Part 1 of 3 A practical guide to durable execution for production AI agents and agentic systems.
Most AI agents work in demos. Few survive in production.
A couple of years ago, I worked on an AI content platform that looked impressive in demos and painful in production. One agent per keyword, roughly 20 steps per article, about 20 minutes and a few dollars per run, fine until hundreds of agents were in flight at once.
Then the usual failures showed up: timeouts, scraping errors, transient LLM errors, workers restarting mid-job. The agent did not resume. It started over, repeating expensive research, paying for the same LLM calls again, and clogging the pipeline for everyone behind that customer. Refund requests followed. The team stopped building and started firefighting.
The uncomfortable realization was simple: the agents were not broken. The architecture was.
Why This Is Part 1
I first laid out this framework at AI Council 2026 (slides PDF). This series expands on three pillars, starting with the one most teams skip:
| Pillar | The question |
|---|---|
| Durable execution | When the agent crashes, does it matter? |
| Durable autonomy | When the agent makes a confident wrong decision, does anyone catch it? |
| Durable statefulness | When the agent drifts off track, can it find its way back? |
This article is about the first pillar. Before memory, governance, or cleverer reasoning loops, the work has to survive failure. If an agent cannot resume where it left off, it is not production-ready yet.
Why Retries Are Not Enough
The first fix most teams reach for is a retry wrapper. That is not a bad instinct for rate limits and transient provider errors. In our case, LLM retries helped, and exposed the ceiling of naive retry logic. Three gaps remained:
- Process crash survival: retry state lived in memory, so a worker restart meant starting over
- Non-LLM failures: scraping, search APIs, database writes, and file operations were not wrapped
- No error classification: transient and fatal failures were treated the same
A retry in memory is not durability. It is optimism with a loop around it.
Workflow state lived inside the running process. Once the process died, the agent had no record of completed steps, with no way to reuse research from step 7 or a draft from step 10 instead of paying for them again.
What Durable Execution in Agentic Systems Actually Means
Durable execution is not about preventing failures. Failures are guaranteed: networks, APIs, workers, deployments, memory pressure.
Durable execution is about making those failures irrelevant.
In an agentic system, that means a long-running workflow can survive crashes, restarts, transient errors, and partial failures without losing progress or repeating expensive work unnecessarily.
There are three properties I look for:
- Workflow state must live outside the process. If state only exists in memory, the process boundary is the durability boundary. Once the process dies, the work dies with it.
- Fault tolerance must wrap every operation. Production agents do not only call LLMs. They search, scrape, write to databases, send messages, call APIs, enqueue jobs, transform files, and update external systems. Any one of those can fail.
- The system must distinguish retryable failures from fatal ones. A 429 rate limit should not be treated like a 400 bad request. A temporary network failure should not be treated like an invalid tool invocation. Durable execution is not infinite retrying. It is policy-driven recovery.
The Maturity Model: From Restarts to Resumability
Most agent systems move through the same three levels.
| Level | What it looks like | What happens in production |
|---|---|---|
| 0 | No fault tolerance: any failure restarts from scratch | Works in short, controlled demos; fails under concurrency and long runtimes |
| 1 | Retry harness around LLM or selected API calls | Cannot survive process crashes; partial protection; expensive work still gets repeated |
| 2 | Durable execution: persisted state, policy-driven retries, idempotent side effects | Work survives worker crashes and resumes from the last known good point |
The Patterns That Make Agents Durable
Durable execution is not one pattern. It is a set of boring, well-tested distributed systems ideas applied to agents.
- Checkpointing: persist state at meaningful boundaries so recovery starts from the last good point, not step 1.
- Idempotency: steps must survive being run twice; otherwise retries create duplicate charges, publishes, or notifications.
- Event sourcing: record what happened, not just latest state, for auditability and replay.
- Sagas and dead-letter queues: compensating actions when multi-step side effects fail partway through; failed work goes somewhere inspectable instead of blocking the pipeline.
Where Temporal, LangGraph, and Other Tools Fit
Durable execution is an architecture problem first. Two common approaches:
Temporal records activity results in a durable event log and replays on resume; completed work is not re-run, but orchestration code must stay deterministic. Best for long, predictable, expensive workflows where you want infrastructure-level guarantees.
LangGraph checkpoints typed state at each node to an external store and resumes from the last checkpoint, with no replay and more control over storage and recovery, better when branching, cycles, or inspectable state matter.
Lighter options (Step Functions, Inngest, Prefect, and others) fit simpler pipelines. The wrong choice is not picking the wrong framework. It is having no durability model at all.
A Production-Readiness Checklist
Before calling an agent production-ready:
- Is workflow state persisted outside the process?
- Can it resume from the failure point, not step 1?
- Are all external operations protected, not just LLM calls?
- Are failures classified as retryable, fatal, or escalated?
- Are side effects idempotent? Can failed runs be inspected and reprocessed?
Nothing Else Matters Until This Works
Agents are probabilistic at the reasoning layer, but execution cannot be hand-wavy. If a crash turns a 20-step workflow into a full restart, the agent is still in demo land. Durable execution is the foundation: autonomy, memory, and governance come after the work can survive failure.
We fixed the crashes. The agents stopped dying. Then we noticed something worse: they were completing successfully and still getting it wrong. Part 2 takes up durable autonomy; Part 3, durable statefulness.
Frequently asked questions
Why aren't retries enough for production AI agents?
Retries help with transient LLM API failures, but they do not survive process crashes, protect non-LLM operations, or distinguish retryable from fatal errors. If workflow state lives only in memory, a worker restart still means starting over from scratch.
What is durable execution in agentic systems?
Durable execution means a long-running agent workflow can survive crashes, restarts, and transient failures without losing progress or repeating expensive work unnecessarily. It requires persisted workflow state, fault tolerance around every operation, and policy-driven retry handling.
When should you choose Temporal vs LangGraph for agent durability?
Choose Temporal for long-running, predictable workflows where failures are expensive and you want strong execution guarantees from infrastructure. Choose LangGraph for complex, dynamic agent control flow where transparent, inspectable state and a lighter footprint matter more.
This is Part 1 of a series on durable agentic systems.