Why This Matters
Your system works. Chapter 15 proved it — all components wired together, requests flowing through the pipeline, results coming back. Ship it.
Then what? A user reports that tasks take 10x longer than yesterday. Another says their patches silently vanish. A third gets an error message that says "undefined." You look at the logs and find 50,000 lines of unstructured text. Somewhere in there is the answer. Good luck.
Observability is not logging. Logging tells you what happened. Observability tells you why. A system with good observability lets you ask arbitrary questions about its behavior after the fact, without having anticipated those specific questions. It does this through three pillars: structured traces (what happened, in what order), metrics (how fast, how often, how much), and alerts (when something crosses a threshold that matters).
Throughout this course, you have been emitting structured trace events — supervisor_decompose, worker_completed, checkpoint_created. This chapter elevates those events from debugging aids into a proper observability layer. And it addresses the question you have been building toward since Chapter 1: what happens when things go wrong? Not one thing — everything. Every failure mode the system can encounter, handled systematically.
What You Will Build
A structured error handling system that covers all 7 failure modes, a retry engine with circuit breakers, and an observability pipeline that collects trace events into queryable, alertable data.
Story Mode for this chapter is coming soon
We are crafting a fun, code-free explanation with metaphors and interactive mini-games. In the meantime, switch to Builder Mode to start learning.
What's Next
There is no next chapter. This is it.
Over 16 chapters, you built a multi-agent coding system from scratch. Start with a loop that calls an LLM and executes tool results. Add a registry so tools are discoverable. Plan with task graphs. Remember with context. Edit files safely. Gate changes with approvals. Isolate execution in worktrees. Parallelize with lanes. Delegate with supervisors and workers. Schedule with DAGs. Review with human-in-the-loop queues. Hand off with checkpoints. Inject capabilities with skills. Automate with events. Assemble into a complete system. Observe, debug, and operate it in production.
These are the same patterns used by OpenAI Codex, Anthropic Claude Code, Devin, Factory, and every serious multi-agent coding system in production today. The implementations differ. The scale differs. But the architecture — the loop, the tools, the planner, the memory, the safety layers, the orchestration, the observability — is the same.
You are no longer a user of these systems. You understand how they work. You can build them, debug them, and extend them. Go build something.