Calm Systems Scale Better — Subhakanta Kar

There is a category of engineer who is admired for their ability to fix things under pressure. Systems go down in the middle of the night, and they restore them. Complex failures cascade, and they unravel them. The organisation comes to depend on this person — and that dependence is, in most cases, a sign that something has gone wrong architecturally.

Systems that require heroics to operate are not good systems. They are systems that have been designed, consciously or not, in a way that makes routine operation difficult and failure recovery dependent on individual expertise. Calm systems do not need heroes. They need process.

Two paths from load increase: calm disciplines lead through observability and bounded failures to scale as engineering; accumulated drift leads through hero reliance to scale as emergency. — Two paths — how calm and agitated systems diverge at scale

What calmness means in a technical system

A calm system is one that behaves predictably under normal conditions and degrades gracefully under abnormal ones. It does not have components that fail silently. It does not have failure modes that are undiscoverable except by the person who built the system. And it does not require the person diagnosing a failure to hold the entire system model in their head in order to understand what went wrong.

Practically, this means observable systems — ones that emit useful signals about their own state, that log meaningfully, that have dashboards someone other than the original engineer can read. It means bounded failure — where one component failing does not cascade into multiple failures before anyone notices. And it means documented runbooks — not because the engineers who are on call are incompetent, but because at 2am, even competent engineers benefit from not having to rediscover what they already know.

Why agitated systems get built

Most systems do not start out agitated. They become that way through accumulation. A workaround that was supposed to be temporary. A coupling introduced under time pressure that was never refactored. An alert threshold set so low that the team learned to ignore it. A manual step in a process that was never automated because the person who understood it had not left yet.

Each of these is a small addition to the system's operational complexity. None of them alone creates a crisis. But together, over time, they produce a system where routine operation requires constant attention, and failures require heroics to resolve. The team adapts by developing expertise in the system's quirks — which makes them indispensable, and makes the system progressively harder for anyone else to operate.

This pattern is self-reinforcing. Agitated systems attract engineers who are good at thriving in agitated environments. Those engineers often make further decisions that add complexity, because complexity is where their expertise creates value. The system becomes harder to calm down with every passing quarter.

Building for calm from the start

The decisions that produce calm systems are mostly made early. Choosing observable components over clever ones. Designing for graceful degradation before designing for peak performance. Automating the routine operational steps before adding new features. Writing the runbook before the system is in production, while the design is still clear in everyone's mind.

None of these are expensive. They are disciplines — choices about where to invest attention. The difficulty is that they produce value slowly and invisibly. A system that never has a crisis does not demonstrate its architecture to anyone. The engineers who built it calmly do not receive the recognition that the engineers who fixed it heroically do.

This is an incentive problem, not an engineering problem. Organisations that want calm systems need to reward the absence of incidents at least as much as they reward the resolution of them.

Why calm has to be designed in

Calm systems do not happen by accident. They are the result of people deciding, early, that operational predictability matters — not as an afterthought once the features are done, but as one of the features. And they are maintained by organisations that understand that the absence of drama is not luck. It is the result of deliberate decisions made, consistently, over time.

Scale makes this more important, not less. A system that is slightly agitated at small scale becomes a crisis at large scale — the arithmetic does not soften with time. Build calm in, from the beginning, and scale becomes an engineering problem. Inherit agitation, and scale becomes an operational emergency.

Where to start

Systems that require heroic individuals to operate are architecturally fragile, not just operationally inconvenient. Dependence on individuals is a design warning sign.
Observability, bounded failure, and documented runbooks are calm-system disciplines. They cost less to build in early than to retrofit under pressure.
Reward the absence of incidents. Teams that build calm systems often go unrecognised because nothing goes wrong — which is exactly the outcome worth celebrating.