🚀 Now in Phase 3A - Production Ready with Advanced Features
Blog
Designing for Failure

Designing for Failure: Circuit Breakers and the Art of Resilient Agents

January 16, 2026

There's a famous story about the 2003 Northeast blackout.

A software bug in an alarm system in Ohio went unnoticed. A power line sagged into a tree. The alarm that should have alerted operators stayed silent. Within three hours, 55 million people across eight states and Canada lost power. Factories shut down. Traffic lights went dark. Hospitals switched to generators. The economic cost exceeded $6 billion.

One bug. One tree. Continental chaos.

This is the nature of complex systems: they don't fail gracefully. They cascade. A small perturbation in one corner propagates through interconnections until the entire network convulses. The more connected the system, the faster the collapse.

Now consider an agentic economy.

Thousands of autonomous agents, firing millions of API calls, managing billions of dollars in transactions, operating at machine speed without human oversight. Every agent depends on other agents. Every service depends on other services. The interconnection density makes the power grid look simple.

In this environment, failure isn't a bug to be eliminated. It's a certainty to be designed around.

The Thundering Herd

Let's walk through a scenario.

A payment gateway hiccups. Maybe a network switch rebooted. Maybe a database query took 500 milliseconds instead of 50. In human terms, imperceptible. In machine terms, an eternity.

Instantly, 5,000 agents receive timeout errors. Being diligent little robots—programmed to complete their tasks, programmed to retry on failure—they all decide to retry immediately.

Now the payment gateway, which was struggling to serve normal load, receives 5,000 simultaneous retry requests. It buckles. More timeouts. More retries. The agents, still diligent, escalate their retry frequency. Within seconds, a minor glitch has mutated into an unintentional Distributed Denial of Service attack.

The payment gateway goes down completely. But it doesn't stop there.

The billing system, which depends on the payment gateway, starts timing out. Agents waiting for billing confirmations start retrying. The authentication service, overwhelmed by the spike in failed sessions, starts rejecting legitimate requests. The cascade propagates. Services that had nothing to do with the original hiccup start failing because they share infrastructure with services that did.

Ten seconds ago, a network switch rebooted. Now the entire platform is down.

This is the thundering herd problem. And it's not theoretical—it's happened to every major distributed system at scale. Netflix. Amazon. Google. Twitter. The specifics vary; the pattern is universal.

When we built Abba Baba, we knew that happy paths were easy. Any junior engineer can build systems that work when everything works. The real engineering challenge was designing for the apocalypse.

Here's how we keep the lights on when the agents go dark.

The Circuit Breaker: Knowing When to Quit

The first line of defense is borrowed from electrical engineering.

In your home, there's a circuit breaker panel. If too much current flows through a wire—more than it can safely handle—the breaker trips. Power to that circuit cuts off instantly. This prevents the wire from overheating, melting its insulation, and burning your house down.

The breaker doesn't fix the underlying problem. It just prevents the problem from becoming catastrophic. You lose power to your kitchen; you don't lose your house.

We implemented the same pattern in software.

Every external integration—payment gateways, data providers, third-party APIs—is wrapped in a circuit breaker. The breaker monitors the health of that connection in real-time, tracking success rates, error rates, and response times.

The state machine is simple:

CLOSED (Normal Operation)

Traffic flows freely. Agents interact with the service as usual. The breaker watches, counting successes and failures, but doesn't intervene.

This is the steady state. Everything works. Life is good.

OPEN (The Trip)

If error rates cross a threshold—too many failures in too short a window—the breaker trips.

Instantly, the system cuts off access to that service. Any agent trying to use it receives an immediate "fail fast" response. No waiting for timeouts. No adding load to an already struggling service. No contributing to the cascade.

The failing service gets breathing room. The agents get clear feedback: this path is blocked, try something else or wait.

This is the crucial insight: sometimes the most helpful thing you can do is nothing. Stop retrying. Stop adding load. Stop making the problem worse. Just... stop.

HALF-OPEN (The Test)

The breaker can't stay open forever. Eventually, the service might recover. How do we know when it's safe to try again?

After a cooldown period, the breaker enters a tentative state. It allows a single request through—a probe. If the probe succeeds, the breaker resets to CLOSED. Normal traffic resumes. If the probe fails, the breaker snaps back to OPEN. The cooldown restarts.

This creates an automatic recovery cycle. The system heals itself when conditions permit, without human intervention, without false optimism, without rushing back too soon.

The Cascade Stops Here

The circuit breaker pattern ensures that a localized failure stays localized. The payment gateway can be on fire, but the product discovery service keeps running. The billing system can be overwhelmed, but agent registration still works.

Failures become isolated events, not contagious diseases.

Exponential Backoff with Jitter: The Polite Retry

Circuit breakers handle sustained failures. But what about transient ones?

Sometimes a request fails for reasons that will resolve themselves: network congestion, momentary overload, a garbage collection pause. In these cases, retrying makes sense. The question is when.

If 1,000 agents experience a failure at time T, and they all retry at time T+1, they'll crush the server again. The retry storm is just as dangerous as the original failure.

Naive retry logic creates synchronized waves of traffic that amplify problems instead of solving them.

We implemented a smarter approach: exponential backoff with jitter.

Exponential Backoff

The wait time doubles with each failure.

First retry: wait 1 second. Second retry: wait 2 seconds. Third retry: wait 4 seconds. Fourth retry: wait 8 seconds.

This spreading gives the system progressively more time to recover. If the first retry doesn't work, the problem is probably more serious than a momentary blip. Give it more time. If the second retry doesn't work, give it even more time.

The exponential growth ensures that persistent problems don't result in persistent hammering.

Jitter: The Secret Sauce

Exponential backoff alone isn't enough. If 1,000 agents all start their backoff at the same moment, they'll all retry at the same moments—just less frequently. The waves are stretched out, but they're still synchronized.

Jitter adds randomness.

Instead of waiting exactly 2 seconds, Agent A waits 2.3 seconds while Agent B waits 1.8 seconds. Instead of a spike, you get a smear. Instead of 1,000 requests hitting at T+2, you get requests distributed across a window from T+1.5 to T+2.5.

This simple randomization—just a few hundred milliseconds of variance—transforms retry traffic from a battering ram into a gentle stream. The server can digest load gracefully instead of being overwhelmed in bursts.

It's a small thing. It makes an enormous difference.

The Mathematics of Politeness

There's a deeper principle here: in distributed systems, politeness scales.

A single agent retrying aggressively has minimal impact. But multiply that behavior by thousands of agents, and aggression becomes catastrophic. Conversely, a single agent being slightly more patient has minimal benefit. But multiply that patience by thousands of agents, and you get system-wide stability.

We've tuned our retry policies to be good citizens at scale. Our agents don't just accomplish their tasks—they accomplish them in ways that preserve the health of the shared infrastructure.

Entity Locking: The Traffic Cop

Concurrency is the silent killer of financial systems.

Here's the nightmare scenario: Agent A reads a wallet balance of $100 and begins calculating a deduction. At the exact same millisecond, Agent B reads the same wallet balance—also $100—and begins calculating an addition.

Agent A finishes first: $100 - $30 = $70. Writes to database. Agent B finishes second: $100 + $20 = $120. Writes to database.

The wallet now shows $120. But wait—it should show $90. The deduction was lost. $30 vanished into the ether.

This is a race condition, and it's notoriously hard to catch in testing. Everything works fine with one user. Everything works fine with ten users. But at scale, with thousands of concurrent operations, the collisions become inevitable. And in financial systems, collisions mean corrupted ledgers, missing money, angry customers, and regulatory nightmares.

We solved this with database-backed entity locking.

Before an agent can modify a sensitive resource—a wallet, a billing record, an inventory count—it must acquire an exclusive lock on that specific entity. The lock is recorded in the database itself, making it visible across all instances of our application.

If another agent already holds the lock, the second agent waits. Not forever—there are timeouts to prevent deadlocks—but long enough for the first agent to complete its operation cleanly.

The result: no matter how fast transactions fly, no matter how many agents operate concurrently, the math always adds up. Every deduction is recorded. Every addition is captured. The ledger stays consistent.

The Cost of Correctness

Entity locking has a cost: contention. When many agents want to modify the same entity, they queue up, and throughput drops.

We've designed our data model to minimize this. Locks are as granular as possible—you lock a specific wallet, not the entire billing system. Hot paths are optimized to hold locks for the shortest possible duration. And we accept that some operations will be slower in exchange for the guarantee that they'll be correct.

In financial systems, correctness beats speed. Always.

The Dead Letter Queue: Where Bad Events Go to Die (Safely)

Sometimes, no amount of retrying will fix a problem.

Maybe the data is malformed. Maybe there's a bug in the processing logic. Maybe the event references an entity that no longer exists. Maybe the validation rules changed between when the event was created and when it's being processed.

These aren't transient failures. They're permanent ones. Retrying won't help; it'll just waste resources.

In a naive system, these events have two fates, both bad:

  1. Infinite retry loops: The event keeps failing, keeps retrying, clogs the processing queue, delays legitimate work, wastes resources forever.

  2. Silent deletion: The system gives up and discards the event. The failure is invisible. Data is lost. Problems compound without anyone noticing.

We implemented a third option: the Dead Letter Queue.

Freeze, Don't Discard

When an event exceeds its maximum retry attempts, it isn't discarded. It's frozen. The event, along with its full context—what it was trying to do, why it failed, how many times it retried, what error messages it generated—is preserved in a special holding area.

The DLQ is a graveyard, but it's a graveyard with excellent records.

Autopsy and Resurrection

Human engineers (and our operations systems) can examine the dead letters at leisure. Why did this fail? Is there a pattern? Do all the failures involve the same merchant, the same data format, the same edge case?

The DLQ becomes a diagnostic tool. It surfaces problems that would otherwise be invisible. It provides the evidence needed to fix bugs, adjust validation rules, and improve the system.

And when the fix is deployed? The dead letters can be resurrected. We can replay them through the corrected logic, ensuring that no data is ever permanently lost.

No Event Left Behind

This philosophy—preserve everything, discard nothing—is fundamental to how we think about reliability. Storage is cheap. Lost data is expensive. Lost financial data is catastrophic.

Every failed event is a lesson. The DLQ ensures we can learn from every lesson, even the ones that happened at 3 AM on a holiday weekend.

Resilience as a Feature

We don't view these error-handling mechanisms as boring safety rails. We don't consider them the unglamorous plumbing that no one wants to work on.

We view them as essential features of an autonomous economy.

Here's why: resilience enables aggression.

When you know your systems will handle failure gracefully, you can push harder. You can run agents faster. You can process more transactions. You can experiment with riskier optimizations. You can scale to larger volumes.

Without resilience, speed is reckless. With resilience, speed is a competitive advantage.

We built a system that assumes things will break. Services will fail. Networks will partition. Databases will slow. Queues will overflow. Locks will contend. Events will corrupt.

All of these things will happen, probably today, definitely this week.

And when they do? The circuit breakers trip. The backoffs engage. The locks serialize. The dead letters accumulate. The system bends but doesn't break. The failure stays contained. The recovery starts automatically.

Your agents keep running. Your business stays online. Your customers never notice.

That's not the absence of failure. That's the mastery of it.


Next up: Meet Claudette—the AI operator that manages itself.