Chaos Engineering — Resilience Dashboard

The whole BookZilla system with live observability — latency, error rate, fallback rate, circuit state, cache hit ratio, DLQ size. Inject chaos (kill a service, add latency, drop the cache, overload CPU) and watch the resilience patterns absorb it while the metrics react. Prove resilience before real failure does.

The whole BookZilla system, live. Inject chaos and watch the dashboard — with resilience on, the patterns from today's labs absorb it; turn them off and watch it fall over.

p95 latency

120 ms

Error rate

0.5%

Fallback rate

Cache hit ratio

88%

DLQ size

Flight circuit

CLOSED

✓ All healthy — no chaos injected. This is the easy part; production-readiness is about the next line.

Kill the flight provider with resilience ON: the circuit goes OPEN, fallback rate jumps, error rate barely moves. Toggle resilience OFF with the same chaos and watch latency and errors spike — same failure, very different outcome.

What just happened

▹Chaos engineering deliberately breaks things in a controlled way — kill a service, add latency, drop the cache — to prove the system survives BEFORE real failure does it for you.
▹With the resilience patterns on (timeouts, circuit breakers, fallbacks, bulkheads, DLQ), injected chaos turns into graceful degradation: error rate stays low, the fallback rate rises, latency stays bounded, and the DLQ drains. The system bends instead of breaking.
▹Turn the patterns off and the same chaos spikes errors and latency and grows the DLQ unbounded. The observability signals — latency, error rate, fallback rate, circuit state, cache hit ratio, DLQ size — are exactly what you watch to know which is happening.