Interactive Architecture Labs
Don't just hear the concepts — see them move. Click a lab, press the button, and watch monoliths, microservices, queues and events behave in real time.
A hands-on lab for every day of the workshop.
Day 1
Microservices
Monolith vs Microservices — Deploy Simulator
Push a change to Payments. Watch the whole monolith go down to redeploy, while microservices touch only one box.
Who Owns What — Team Ownership Map
Each service owned by one team. Flip to monolith mode and watch every team pile into a single box.
Scale the Hot Path
A campaign spikes video traffic. Clone the whole monolith vs. add replicas to just the Video service. Watch cost & latency.
Shared DB vs Database-per-Service
Reporting runs a heavy query. See it freeze checkout on a shared DB — and stay invisible with a per-service read model.
The Join Problem — One Query, Many Databases
A simple JOIN that takes 8ms in the monolith becomes impossible across separate databases. Try the three real fixes — API composition, a CQRS read model, and the shared-schema escape hatch.
Service Talk — REST vs gRPC & Chatty Calls
Is microservice calling really slow? Toggle REST↔gRPC and chatty↔batched calls to feel exactly where the latency goes — and how to get it back.
Event-Driven
Topology vs Communication
Monolith-or-microservices and synchronous-or-asynchronous are NOT the same question. Click each cell of the 2×2 and see why a monolith can be async and microservices can be sync.
Synchronous vs Asynchronous
Place an order. Sync makes the user wait through a 4-second chain. Async confirms in 200ms and drains work in the background.
Queue vs Topic (Pub/Sub)
Emit OrderPlaced. In a queue one worker grabs it; in a topic it fans out to everyone. Add a WhatsApp subscriber live.
Life of a Message — Queue vs Pub/Sub vs Stream
Drop the same event into all three and watch what's left afterward: a queue deletes it, pub/sub forgets offline subscribers, and a stream keeps it on a replayable log with offsets and consumer groups.
Failure, Retry & Dead Letter Queue
Kill the Email service. Watch retries with backoff, messages land in the DLQ, the order stays green — then replay the DLQ.
Event-Driven System — Watch It Live
Fire an OrderPlaced event and watch it flow producer → broker → many consumers (Inventory, Email, Analytics, Fraud) reacting at once. Add or kill a consumer live, slow one to watch its backlog build, and switch a queue to a topic — the whole system breathing on one canvas.
The Saga Pattern — Orchestration, Choreography & Rollback
Run a multi-step Order saga across services that share no database: Reserve Inventory → Charge Payment → Arrange Shipping → Confirm. Coordinate it with a central orchestrator or with pure choreography — then fail a step and watch compensating transactions roll the whole thing back in reverse.
Agentic AI
Agent Architecture — Microservices vs Monolith
Run the same multi-agent system two ways: each agent as its own API orchestrated by LangGraph, vs one LangGraph monolith in a single codebase. Compare latency, deploys, scaling and failure isolation.
Backpressure — Why a Broker Absorbs Spikes
Push traffic past capacity with no broker and watch the waiting room overflow and drop requests. Add a broker and the same spike is buffered safely — nothing lost — then add a consumer to drain it.
Where Does LangGraph Fit? — Orchestration vs Choreography
Run the same agent pipeline three ways. A LangGraph conductor that holds state and decides order; pure event choreography with no conductor; and the hybrid real systems actually ship — events between services, LangGraph orchestrating the reasoning inside one. See why event-driven doesn't delete the orchestrator, it relocates it.
Agent Orchestration — LangGraph vs Message Broker
The same research-assistant workflow — plan, three parallel researchers, synthesize, critique, loop back if rejected — run two ways. As a LangGraph state machine with a shared state object and a loop-back edge, and as event-driven agents coordinated through a broker. Watch both execute and weigh control, state, resilience and scale.
Day 2
Application Scaling
Find the Bottleneck
Crank the load on QuickMove and watch which layer saturates first — app CPU, DB connections, cache or queue. Diagnose before you scale: the #1 architect skill.
Vertical vs Horizontal — Hitting the Ceiling
Scale up to a bigger box and watch cost climb to a hard ceiling; scale out with more instances and keep up — but only if the app is stateless. Two very different curves.
Stateless vs Stateful — Why Sessions Break Scaling
QuickMove stores login sessions in memory. Log in on Instance A, get routed to Instance B, and you're logged out. Move the session to Redis and any instance serves any request.
Autoscaling — Not Magic
Spike the traffic and watch instances scale out on CPU, then scale back in. Feel the two gotchas: cold-start lag, and 50 app instances still choking on one slow database.
Kubernetes — Replicas, Self-Healing & Rolling Updates
Run QuickMove's booking service on Kubernetes. Set a replica count and the Deployment holds it; kill a pod and it self-heals; ship v2 and it rolls out with zero downtime; turn on the HPA and pods scale with load.
Database Scaling
Index vs Full Table Scan
Run a query with no index and watch the scanner sweep every row; add an index and it jumps straight to the answer. Grow the table and the scan explodes while the index stays flat.
Connection Pooling — When App Scaling Floods the DB
Scale the app to 50 instances and each opens its own database connections — the DB blows past its connection limit and rejects everyone. Add a pool and a few hundred shared connections serve every instance.
Read Replicas & Read/Write Split
ChatSphere's feed is read-heavy. One DB chokes; add replicas and reads fan out while the primary takes writes. Then meet the trade-off: replica lag serving a stale read.
Partitioning vs Sharding — Same Rows, Different Homes
The Day-2 head-scratcher, settled. Split one table into row-ranges inside a single database (partitioning), or distribute those same rows across separate servers with a shard key (sharding). Toggle between them on identical data — and see why it's always rows, never columns, and how you can do both at once.
Sharding & the Shard Key
Split data across DB servers by shard key and watch writes distribute. Then break it: a bad key melts one hot shard while others idle, and a cross-shard join turns ugly.
Strong vs Eventual Consistency
A payment balance needs the latest truth; a feed like-count can lag a moment. Toggle between strong and eventual consistency and watch freshness trade against speed and availability.
Cache the Hot Path
Every feed request hammers the DB. Put a cache in front and watch the hit-ratio climb and DB load collapse — then feel the catch when stale data is served until the cache expires.
Scaling AI Systems
Vector Database Scaling
Retrieval pressure grows with every conversation. Watch an exact nearest-neighbour search slow as embeddings pile up, then scale it with approximate search, namespaces, shards and replicas.
Agentic AI — Same Pressures, New System
The bridge the AI crowd asked for. Take one agentic system and switch on each architecture idea you learned — stateless agents + Redis memory, an event-driven agent mesh, a semantic cache for repeated LLM calls, a sharded & replicated vector store, and autoscaling under load — and watch every concept land in an AI system.
Day 3
Resilience Patterns
Graceful Degradation — The Resilient Search
BookZilla queries Flight, Train and Bus providers. Set each to healthy, slow or down. A naive page hangs on the slow one and breaks on the down one; a resilient page uses a 2s timeout, fails fast to a fallback, and still shows the providers that work — marking the rest temporarily unavailable.
Retry, Backoff & Idempotency
Retry a flaky call and watch a naive retry storm pile onto a recovering service. Switch on exponential backoff, then add jitter to de-sync the herd — and see why you must NOT retry a declined card. Then an idempotency key stops a retried charge from billing twice.
Circuit Breaker — Closed, Open & Half-Open
A failing provider, a live request stream, and the breaker's state machine. Without it, every call waits the full timeout then fails. With it, repeated failures trip the breaker OPEN (calls fast-fail to fallback), a cooldown ticks, then HALF-OPEN probes decide whether to close again. Watch the states flip in real time.
Cascading Failure & Bulkheads
One shared thread pool serves all three providers. Make Flight slow and watch blocked requests fill the pool until Train and Bus starve — a cascading failure. Flip on bulkheads (a pool per provider) and the flight failure stays contained while everything else keeps flowing.
Redundancy & Failover
Run primary + backup as active-passive or active-active, with health checks watching the primary. Kill the primary and watch traffic fail over to the backup — with the real brief blip in between — then compare the two redundancy modes on cost and recovery.
Caching Strategies
Cache Write Strategies — Aside, Through, Around, Back
The four write patterns, made visual. Run a read and a write under cache-aside, write-through, write-around and write-back, and watch the data path light up — then weigh write latency, read freshness and the failure risk (write-back loses data if the cache dies; write-around serves a stale first read).
TTL, Invalidation & Consistency
PriceMart drops prices in a flash sale. Watch a cached price go stale and a customer hit the old price at checkout — then fix it with a short TTL, delete-on-write, or event-based invalidation. Live TTL countdown bars make 'caching is easy, invalidation is hard' obvious.
Cache Eviction — LRU vs LFU vs FIFO
A tiny fixed-size cache and a stream of requests. Watch LRU, LFU and FIFO each evict a different victim when the cache fills — and compare their hit ratios on the same workload to see why the policy matters.
Cache Stampede & Stale-While-Revalidate
A hot key's TTL expires and thousands of requests miss at once, stampeding the origin — the thundering herd. Watch the origin spike, then tame it with a single-flight lock, jittered TTLs, and stale-while-revalidate that serves slightly old data while one request refreshes.
Edge & Chaos
CDN & the Edge
Users around the world, one origin. Without a CDN every request crosses the globe and hammers the origin. Turn the CDN on and users hit a nearby edge instead — latency drops per region and the origin is shielded. Then kill the origin and watch the edges keep serving cached content.
Chaos Engineering — Resilience Dashboard
The whole BookZilla system with live observability — latency, error rate, fallback rate, circuit state, cache hit ratio, DLQ size. Inject chaos (kill a service, add latency, drop the cache, overload CPU) and watch the resilience patterns absorb it while the metrics react. Prove resilience before real failure does.
Resilient AI Systems
Caching in an AI Agent — Every Layer
Where does caching live in an agent? At every layer — the browser, the API gateway, a semantic cache before the LLM, the tool result, and the database. Toggle a cache at each layer and watch where the request gets served, the cost and latency collapse, and why caching earlier wins. Then change the upstream data and meet caching's one danger: an agent that confidently serves a stale answer, and the TTL that bounds it.
Fault-Tolerant Agents — Failover, Retries & Guardrails
Agents fail in AI-specific ways: the model rate-limits, a tool errors, the output hallucinates. Inject each and watch the same resilience patterns respond — retry with backoff, a circuit breaker that fails over to a backup model, a cached tool fallback, and a guardrail that catches bad output and regenerates. Same patterns as the rest of Day 3, applied to an agent.
Day 4
Secure Architecture
Broken Access Control (IDOR)
The exact healthcare breach: logged in as one patient, change the record id in the URL. Without a server-side ownership check you read someone else's report; with it you get 403. Login is not authorization.
Role-Based Access Control — Permission Matrix
Pick a role — patient, doctor, insurer, admin — and try actions like viewing others' reports or managing users. Watch a live allow/deny matrix decide each one. Each role gets only what it needs.
JWT — Decoded & Tampered
A real header.payload.signature token, decoded into its claims. Edit role from patient to admin and the signature check fails; let it expire and it's rejected. See exactly why a JWT must be validated, not trusted.
SQL Injection — Live
Type ' OR 1=1 -- into a login box. A string-built query hands the attacker every row; a parameterized query treats it as harmless text. Watch the actual SQL the database receives in both cases.
STRIDE Threat Model
Walk the appointment system through STRIDE — Spoofing, Tampering, Repudiation, Information disclosure, DoS, Elevation of privilege. Click each threat to see how it attacks the system and the mitigation that stops it.
Least Privilege & Blast Radius
A credential leaks. With an over-permissive role the attacker's reach spreads across every resource; with least privilege the damage is contained to one box. Smaller privilege, smaller blast radius.
Defense in Depth
An attacker tries to reach patient data through layers — WAF, gateway auth, service authorization, network isolation, DB permissions, encryption. Toggle layers on and off and watch how far the attack gets when one fails.
Authentication vs Authorization vs RBAC
Three words people blur together, separated cleanly. Send a request through two gates — Authentication (who are you?) then Authorization (are you allowed?) — and see RBAC as the mechanism that decides the second. Watch a valid login still get denied the wrong action.
OWASP Top 10 (2025) — Explorer
All ten of the 2025 OWASP risks in one place. Click any category to see what it is, a concrete healthcare-platform example, and the mitigation — with badges for what's new, moved or renamed since 2021, and links to the hands-on lab for each.
Cloud Security
The Shared Responsibility Model
Sort each item — physical data centers, IAM permissions, network exposure, data encryption, patching, secrets — into 'cloud provider' or 'you'. The provider secures the cloud; you secure what's in it. The most-confused cloud idea, made clear.
Cloud Misconfiguration Finder
A cloud setup with switches: make the storage bucket public, open a security group to 0.0.0.0/0, grant IAM admin, hardcode a secret. Each flip lights up the path an attacker takes from the internet to your data — then fix them to go green.
Secrets — Exposed vs Managed
Watch a database password leak through frontend JavaScript, a GitHub commit, a Docker image and a log line — then move it into a managed secret store, injected at runtime and rotated. See exactly where secrets escape.
Observability
Three Pillars — One Incident, Three Lenses
A booking request is slow. See the same incident through a metric (latency spiked), then logs (booking errors), then a trace (an 8s payment span). Why you need metrics, logs AND traces — and how they hand off.
Latency Percentiles — p50 / p95 / p99
A live latency distribution. The average says 300ms and looks fine — but the p99 marker sits at 8s because of a slow tail. Add a few slow requests and watch the average barely move while p99 explodes.
Structured Logging & Correlation IDs
Plain logs from five services interleave into noise — impossible to follow one request. Switch on structured logs with a shared trace id, filter to one id, and the request's whole journey across services reconstructs itself.
Distributed Tracing — The Waterfall
A slow request crosses API Gateway → Auth → Appointment → Payment → Notification. Render the span waterfall and spot the culprit instantly. Inject latency into any service and watch the waterfall shift to point at it.
Alerting & Alert Fatigue
Tune threshold, duration and impact. A bare 'CPU > 80%' fires constantly and gets ignored; an SLO-style 'booking errors > 5% for 5 min affecting >100 users' fires only on real incidents. Watch false alarms trade off against missed incidents.
Secure & Observable AI
Security for an AI Agent — Attack Surface
An agent's whole attack surface, made concrete. Pick an attack — a sensitive-data leak through logs and cache, a prompt injection that hijacks the tools, or a poisoned RAG document whose text is treated as code — run it, and watch it land. Then layer on defences (input guardrails, redaction, least-privilege tools, RAG sanitisation) and see what stops each, and why least-privilege is the backstop.
Metrics, Logs & Traces for an AI Agent
One agent run through all three observability lenses. Metrics (tokens, LLM latency, failed tools, and a faithfulness eval) tell you something's wrong; correlation-id logs let you filter one request's journey out of the noise; and the trace waterfall shows which span ate the time. The headline: a hallucination returns HTTP 200 with green latency — only the eval catches it. For agents, success ≠ correctness.