All labs
Lab 44
Resilient AI Systems

Fault-Tolerant Agents — Failover, Retries & Guardrails

Agents fail in AI-specific ways: the model rate-limits, a tool errors, the output hallucinates. Inject each and watch the same resilience patterns respond — retry with backoff, a circuit breaker that fails over to a backup model, a cached tool fallback, and a guardrail that catches bad output and regenerates. Same patterns as the rest of Day 3, applied to an agent.

The same fault-tolerance patterns from today, applied to an AI agent. Pick which failures to inject, then run the request and watch the patterns keep the agent answering.
🧠
Model call
idle
🛠️
Tool call
idle
🛡️
Guardrail
idle
Agent trace
// run a request
AI failure
Resilience pattern (from Day 3)
Model rate-limited / down
Retry + backoff → circuit breaker → failover to backup model
Tool call errors
Retry → fallback to cached tool result
Hallucinated / invalid output
Guardrail validation → regenerate → safe fallback answer
Safe automation — approval, missing data & idempotency

Resilience keeps the agent answering — but an enterprise agent must also act safely. Pick the risk tier and whether finance data came back, and see the approval decision.

🚫 Finance data is unavailable — blocked. Mark finance review pending and escalate. Never auto-approve on missing critical data, whatever the risk tier.
Now trigger the approval workflow — the call times out and retries.
What just happened