Future Proof India · Software Architecture Workshop

Interactive Architecture Labs

Don't just hear the concepts — see them move. Click a lab, press the button, and watch monoliths, microservices, queues and events behave in real time.

A hands-on lab for every day of the workshop.

Day 1

Microservices, Event-Driven & Agentic Architecture

17 labs

Microservices

Monolith vs Microservices — Deploy Simulator

Push a change to Payments. Watch the whole monolith go down to redeploy, while microservices touch only one box.

Independent deploymentOpen →

Who Owns What — Team Ownership Map

Each service owned by one team. Flip to monolith mode and watch every team pile into a single box.

Business-capability boundaries & team autonomyOpen →

Scale the Hot Path

A campaign spikes video traffic. Clone the whole monolith vs. add replicas to just the Video service. Watch cost & latency.

Independent scalingOpen →

Shared DB vs Database-per-Service

Reporting runs a heavy query. See it freeze checkout on a shared DB — and stay invisible with a per-service read model.

Data ownership & the shared-DB anti-patternOpen →

The Join Problem — One Query, Many Databases

A simple JOIN that takes 8ms in the monolith becomes impossible across separate databases. Try the three real fixes — API composition, a CQRS read model, and the shared-schema escape hatch.

Querying across database-per-serviceOpen →

Service Talk — REST vs gRPC & Chatty Calls

Is microservice calling really slow? Toggle REST↔gRPC and chatty↔batched calls to feel exactly where the latency goes — and how to get it back.

Synchronous communication costOpen →

Event-Driven

Topology vs Communication

Monolith-or-microservices and synchronous-or-asynchronous are NOT the same question. Click each cell of the 2×2 and see why a monolith can be async and microservices can be sync.

Topology vs communication are orthogonalOpen →

Synchronous vs Asynchronous

Place an order. Sync makes the user wait through a 4-second chain. Async confirms in 200ms and drains work in the background.

Sync request/response vs async messagingOpen →

Queue vs Topic (Pub/Sub)

Emit OrderPlaced. In a queue one worker grabs it; in a topic it fans out to everyone. Add a WhatsApp subscriber live.

Competing consumers vs fan-outOpen →

Life of a Message — Queue vs Pub/Sub vs Stream

Drop the same event into all three and watch what's left afterward: a queue deletes it, pub/sub forgets offline subscribers, and a stream keeps it on a replayable log with offsets and consumer groups.

Message retention, offsets & replayOpen →

Failure, Retry & Dead Letter Queue

Kill the Email service. Watch retries with backoff, messages land in the DLQ, the order stays green — then replay the DLQ.

Retry, backoff, DLQ, idempotencyOpen →

Event-Driven System — Watch It Live

Fire an OrderPlaced event and watch it flow producer → broker → many consumers (Inventory, Email, Analytics, Fraud) reacting at once. Add or kill a consumer live, slow one to watch its backlog build, and switch a queue to a topic — the whole system breathing on one canvas.

End-to-end event flow: produce → route → consumeOpen →

The Saga Pattern — Orchestration, Choreography & Rollback

Run a multi-step Order saga across services that share no database: Reserve Inventory → Charge Payment → Arrange Shipping → Confirm. Coordinate it with a central orchestrator or with pure choreography — then fail a step and watch compensating transactions roll the whole thing back in reverse.

Distributed transactions: saga orchestration vs choreography + compensationOpen →

Agentic AI

Agent Architecture — Microservices vs Monolith

Run the same multi-agent system two ways: each agent as its own API orchestrated by LangGraph, vs one LangGraph monolith in a single codebase. Compare latency, deploys, scaling and failure isolation.

Microservices for multi-agent systems (the WHERE axis)Open →

Backpressure — Why a Broker Absorbs Spikes

Push traffic past capacity with no broker and watch the waiting room overflow and drop requests. Add a broker and the same spike is buffered safely — nothing lost — then add a consumer to drain it.

A broker decouples producers from consumers (the HOW axis)Open →

Where Does LangGraph Fit? — Orchestration vs Choreography

Run the same agent pipeline three ways. A LangGraph conductor that holds state and decides order; pure event choreography with no conductor; and the hybrid real systems actually ship — events between services, LangGraph orchestrating the reasoning inside one. See why event-driven doesn't delete the orchestrator, it relocates it.

Control flow: orchestration vs choreography (the WHO axis)Open →

Agent Orchestration — LangGraph vs Message Broker

The same research-assistant workflow — plan, three parallel researchers, synthesize, critique, loop back if rejected — run two ways. As a LangGraph state machine with a shared state object and a loop-back edge, and as event-driven agents coordinated through a broker. Watch both execute and weigh control, state, resilience and scale.

Orchestrating multi-agent workflows: in-process graph vs event brokerOpen →

Day 2

Scaling Applications & Databases

14 labs

Application Scaling

Find the Bottleneck

Crank the load on QuickMove and watch which layer saturates first — app CPU, DB connections, cache or queue. Diagnose before you scale: the #1 architect skill.

Diagnose the pressure point before choosing a fixOpen →

Vertical vs Horizontal — Hitting the Ceiling

Scale up to a bigger box and watch cost climb to a hard ceiling; scale out with more instances and keep up — but only if the app is stateless. Two very different curves.

Scale-up ceiling & cost vs scale-outOpen →

Stateless vs Stateful — Why Sessions Break Scaling

QuickMove stores login sessions in memory. Log in on Instance A, get routed to Instance B, and you're logged out. Move the session to Redis and any instance serves any request.

Statelessness is the prerequisite for horizontal scalingOpen →

Autoscaling — Not Magic

Spike the traffic and watch instances scale out on CPU, then scale back in. Feel the two gotchas: cold-start lag, and 50 app instances still choking on one slow database.

Autoscaling signals, cold starts & downstream limitsOpen →

Kubernetes — Replicas, Self-Healing & Rolling Updates

Run QuickMove's booking service on Kubernetes. Set a replica count and the Deployment holds it; kill a pod and it self-heals; ship v2 and it rolls out with zero downtime; turn on the HPA and pods scale with load.

Declarative replicas, self-healing & rolling updatesOpen →

Database Scaling

Index vs Full Table Scan

Run a query with no index and watch the scanner sweep every row; add an index and it jumps straight to the answer. Grow the table and the scan explodes while the index stays flat.

Indexes & composite indexes vs full scansOpen →

Connection Pooling — When App Scaling Floods the DB

Scale the app to 50 instances and each opens its own database connections — the DB blows past its connection limit and rejects everyone. Add a pool and a few hundred shared connections serve every instance.

Connection pooling vs connection exhaustionOpen →

Read Replicas & Read/Write Split

ChatSphere's feed is read-heavy. One DB chokes; add replicas and reads fan out while the primary takes writes. Then meet the trade-off: replica lag serving a stale read.

Read/write split, replicas & replica lagOpen →

Partitioning vs Sharding — Same Rows, Different Homes

The Day-2 head-scratcher, settled. Split one table into row-ranges inside a single database (partitioning), or distribute those same rows across separate servers with a shard key (sharding). Toggle between them on identical data — and see why it's always rows, never columns, and how you can do both at once.

Partitioning (within a node) vs sharding (across nodes)Open →

Sharding & the Shard Key

Split data across DB servers by shard key and watch writes distribute. Then break it: a bad key melts one hot shard while others idle, and a cross-shard join turns ugly.

Horizontal data partitioning & shard-key choiceOpen →

Strong vs Eventual Consistency

A payment balance needs the latest truth; a feed like-count can lag a moment. Toggle between strong and eventual consistency and watch freshness trade against speed and availability.

Strong vs eventual consistency trade-offsOpen →

Cache the Hot Path

Every feed request hammers the DB. Put a cache in front and watch the hit-ratio climb and DB load collapse — then feel the catch when stale data is served until the cache expires.

Caching, hit ratio & invalidation/stalenessOpen →

Scaling AI Systems

Vector Database Scaling

Retrieval pressure grows with every conversation. Watch an exact nearest-neighbour search slow as embeddings pile up, then scale it with approximate search, namespaces, shards and replicas.

Scaling the retrieval layer for agentic AIOpen →

Agentic AI — Same Pressures, New System

The bridge the AI crowd asked for. Take one agentic system and switch on each architecture idea you learned — stateless agents + Redis memory, an event-driven agent mesh, a semantic cache for repeated LLM calls, a sharded & replicated vector store, and autoscaling under load — and watch every concept land in an AI system.

Applying scaling, caching, EDA & state to AI agentsOpen →

Day 3

Fault Tolerance, Resilience, Caching & CDN

13 labs

Resilience Patterns

Graceful Degradation — The Resilient Search

BookZilla queries Flight, Train and Bus providers. Set each to healthy, slow or down. A naive page hangs on the slow one and breaks on the down one; a resilient page uses a 2s timeout, fails fast to a fallback, and still shows the providers that work — marking the rest temporarily unavailable.

Timeout, fail-fast, fallback & partial resultsOpen →

Retry, Backoff & Idempotency

Retry a flaky call and watch a naive retry storm pile onto a recovering service. Switch on exponential backoff, then add jitter to de-sync the herd — and see why you must NOT retry a declined card. Then an idempotency key stops a retried charge from billing twice.

Backoff, jitter, when-not-to-retry & idempotency keysOpen →

Circuit Breaker — Closed, Open & Half-Open

A failing provider, a live request stream, and the breaker's state machine. Without it, every call waits the full timeout then fails. With it, repeated failures trip the breaker OPEN (calls fast-fail to fallback), a cooldown ticks, then HALF-OPEN probes decide whether to close again. Watch the states flip in real time.

The circuit-breaker state machine & fast-failOpen →

Cascading Failure & Bulkheads

One shared thread pool serves all three providers. Make Flight slow and watch blocked requests fill the pool until Train and Bus starve — a cascading failure. Flip on bulkheads (a pool per provider) and the flight failure stays contained while everything else keeps flowing.

Resource isolation stops one failure sinking the systemOpen →

Redundancy & Failover

Run primary + backup as active-passive or active-active, with health checks watching the primary. Kill the primary and watch traffic fail over to the backup — with the real brief blip in between — then compare the two redundancy modes on cost and recovery.

Redundancy, health checks & automatic failoverOpen →

Caching Strategies

Cache Write Strategies — Aside, Through, Around, Back

The four write patterns, made visual. Run a read and a write under cache-aside, write-through, write-around and write-back, and watch the data path light up — then weigh write latency, read freshness and the failure risk (write-back loses data if the cache dies; write-around serves a stale first read).

Cache-aside vs write-through/around/back trade-offsOpen →

TTL, Invalidation & Consistency

PriceMart drops prices in a flash sale. Watch a cached price go stale and a customer hit the old price at checkout — then fix it with a short TTL, delete-on-write, or event-based invalidation. Live TTL countdown bars make 'caching is easy, invalidation is hard' obvious.

TTL, invalidation strategies & cache consistencyOpen →

Cache Eviction — LRU vs LFU vs FIFO

A tiny fixed-size cache and a stream of requests. Watch LRU, LFU and FIFO each evict a different victim when the cache fills — and compare their hit ratios on the same workload to see why the policy matters.

Eviction policies & their effect on hit ratioOpen →

Cache Stampede & Stale-While-Revalidate

A hot key's TTL expires and thousands of requests miss at once, stampeding the origin — the thundering herd. Watch the origin spike, then tame it with a single-flight lock, jittered TTLs, and stale-while-revalidate that serves slightly old data while one request refreshes.

Thundering herd: single-flight, jitter & stale-while-revalidateOpen →

Edge & Chaos

CDN & the Edge

Users around the world, one origin. Without a CDN every request crosses the globe and hammers the origin. Turn the CDN on and users hit a nearby edge instead — latency drops per region and the origin is shielded. Then kill the origin and watch the edges keep serving cached content.

Edge caching for latency & as a resilience layerOpen →

Chaos Engineering — Resilience Dashboard

The whole BookZilla system with live observability — latency, error rate, fallback rate, circuit state, cache hit ratio, DLQ size. Inject chaos (kill a service, add latency, drop the cache, overload CPU) and watch the resilience patterns absorb it while the metrics react. Prove resilience before real failure does.

Chaos experiments & the signals that prove resilienceOpen →

Resilient AI Systems

Caching in an AI Agent — Every Layer

Where does caching live in an agent? At every layer — the browser, the API gateway, a semantic cache before the LLM, the tool result, and the database. Toggle a cache at each layer and watch where the request gets served, the cost and latency collapse, and why caching earlier wins. Then change the upstream data and meet caching's one danger: an agent that confidently serves a stale answer, and the TTL that bounds it.

Where caching lives in an agent, layer by layerOpen →

Fault-Tolerant Agents — Failover, Retries & Guardrails

Agents fail in AI-specific ways: the model rate-limits, a tool errors, the output hallucinates. Inject each and watch the same resilience patterns respond — retry with backoff, a circuit breaker that fails over to a backup model, a cached tool fallback, and a guardrail that catches bad output and regenerates. Same patterns as the rest of Day 3, applied to an agent.

Timeouts, retries, model failover, fallbacks & guardrails for agentsOpen →

Day 4

Security, Cloud Security & Observability

19 labs

Secure Architecture

Broken Access Control (IDOR)

The exact healthcare breach: logged in as one patient, change the record id in the URL. Without a server-side ownership check you read someone else's report; with it you get 403. Login is not authorization.

Broken access control & ownership checksOpen →

Role-Based Access Control — Permission Matrix

Pick a role — patient, doctor, insurer, admin — and try actions like viewing others' reports or managing users. Watch a live allow/deny matrix decide each one. Each role gets only what it needs.

RBAC roles, permissions & least privilegeOpen →

JWT — Decoded & Tampered

A real header.payload.signature token, decoded into its claims. Edit role from patient to admin and the signature check fails; let it expire and it's rejected. See exactly why a JWT must be validated, not trusted.

Token claims, signatures & validationOpen →

SQL Injection — Live

Type ' OR 1=1 -- into a login box. A string-built query hands the attacker every row; a parameterized query treats it as harmless text. Watch the actual SQL the database receives in both cases.

Injection & parameterized queriesOpen →

STRIDE Threat Model

Walk the appointment system through STRIDE — Spoofing, Tampering, Repudiation, Information disclosure, DoS, Elevation of privilege. Click each threat to see how it attacks the system and the mitigation that stops it.

Structured threat modeling with STRIDEOpen →

Least Privilege & Blast Radius

A credential leaks. With an over-permissive role the attacker's reach spreads across every resource; with least privilege the damage is contained to one box. Smaller privilege, smaller blast radius.

Least privilege limits the blast radiusOpen →

Defense in Depth

An attacker tries to reach patient data through layers — WAF, gateway auth, service authorization, network isolation, DB permissions, encryption. Toggle layers on and off and watch how far the attack gets when one fails.

Layered defense & fail-safe defaultsOpen →

Authentication vs Authorization vs RBAC

Three words people blur together, separated cleanly. Send a request through two gates — Authentication (who are you?) then Authorization (are you allowed?) — and see RBAC as the mechanism that decides the second. Watch a valid login still get denied the wrong action.

AuthN (who) vs AuthZ (what) vs RBAC (how it's decided)Open →

OWASP Top 10 (2025) — Explorer

All ten of the 2025 OWASP risks in one place. Click any category to see what it is, a concrete healthcare-platform example, and the mitigation — with badges for what's new, moved or renamed since 2021, and links to the hands-on lab for each.

The 2025 OWASP Top 10, mapped to real fixesOpen →

Cloud Security

The Shared Responsibility Model

Sort each item — physical data centers, IAM permissions, network exposure, data encryption, patching, secrets — into 'cloud provider' or 'you'. The provider secures the cloud; you secure what's in it. The most-confused cloud idea, made clear.

Who secures what in the cloudOpen →

Cloud Misconfiguration Finder

A cloud setup with switches: make the storage bucket public, open a security group to 0.0.0.0/0, grant IAM admin, hardcode a secret. Each flip lights up the path an attacker takes from the internet to your data — then fix them to go green.

Common cloud misconfigurations & safe defaultsOpen →

Secrets — Exposed vs Managed

Watch a database password leak through frontend JavaScript, a GitHub commit, a Docker image and a log line — then move it into a managed secret store, injected at runtime and rotated. See exactly where secrets escape.

Secret storage, injection & rotationOpen →

Observability

Three Pillars — One Incident, Three Lenses

A booking request is slow. See the same incident through a metric (latency spiked), then logs (booking errors), then a trace (an 8s payment span). Why you need metrics, logs AND traces — and how they hand off.

Metrics, logs & traces togetherOpen →

Latency Percentiles — p50 / p95 / p99

A live latency distribution. The average says 300ms and looks fine — but the p99 marker sits at 8s because of a slow tail. Add a few slow requests and watch the average barely move while p99 explodes.

Why averages hide tail latencyOpen →

Structured Logging & Correlation IDs

Plain logs from five services interleave into noise — impossible to follow one request. Switch on structured logs with a shared trace id, filter to one id, and the request's whole journey across services reconstructs itself.

Structured logs & correlation across servicesOpen →

Distributed Tracing — The Waterfall

A slow request crosses API Gateway → Auth → Appointment → Payment → Notification. Render the span waterfall and spot the culprit instantly. Inject latency into any service and watch the waterfall shift to point at it.

Spans, trace waterfalls & finding the slow serviceOpen →

Alerting & Alert Fatigue

Tune threshold, duration and impact. A bare 'CPU > 80%' fires constantly and gets ignored; an SLO-style 'booking errors > 5% for 5 min affecting >100 users' fires only on real incidents. Watch false alarms trade off against missed incidents.

Actionable alerts vs alert fatigueOpen →

Secure & Observable AI

Security for an AI Agent — Attack Surface

An agent's whole attack surface, made concrete. Pick an attack — a sensitive-data leak through logs and cache, a prompt injection that hijacks the tools, or a poisoned RAG document whose text is treated as code — run it, and watch it land. Then layer on defences (input guardrails, redaction, least-privilege tools, RAG sanitisation) and see what stops each, and why least-privilege is the backstop.

Attack surface & layered defence for agentsOpen →

Metrics, Logs & Traces for an AI Agent

One agent run through all three observability lenses. Metrics (tokens, LLM latency, failed tools, and a faithfulness eval) tell you something's wrong; correlation-id logs let you filter one request's journey out of the noise; and the trace waterfall shows which span ate the time. The headline: a hallucination returns HTTP 200 with green latency — only the eval catches it. For agents, success ≠ correctness.

Metrics, logs & traces — and eval-as-a-metric — for agentsOpen →

Day 5

Capstone — Architecture on an AI Agent

1 labs

Capstone

Contract-Risk Agent — Architecture on an Agent

The finale. Take one enterprise AI agent — a strategic-sourcing & contract-risk agent — and evolve it live through every concept in the workshop, in the order you learned them. Start at a naive monolith and switch on microservices, async/event-driven, scale-out, fault tolerance, caching, security and observability — watching the cost, latency, throughput and CPU meters move, the architecture reshape, and an enterprise-readiness scorecard fill in. Flip the green/red lens to see the difference between the agent knobs every course stops at and the architecture layer that makes it production-grade. Each stage links to its deep-dive lab.

All four days, applied to one agentOpen →