I recently finished reading the excellent book “Release It!” by Michael Nygard. One of the key points that I wanted to remember was the stability anti-patterns. So, this post will serve as a reminder of architectural smells to look out for when designing production systems.

This list of anti-patterns are common forces that will create or accelerate failures in production systems. Given the nature of distributed systems, avoiding these patterns is not possible. You must accept that they will happen, and program your application to be resilient to these failures.

Integration Points

Integration points are the number-one killer of systems. Every single one … presents a stability risk.

Every single socket, process, RPC or REST API call can hang. To avoid failure, your program must treat each system it integrates with as a potential failure point and react accordingly.

Chain Reactions

One server down jeopardizes the rest.

Whenever one server in a pool goes down, the rest of the servers pick up the slack. The increased load makes them more likely to fail, likely from the same defect.

In addition, when running a horizontally scalable service, a defect in the application exists in every one of the servers running the service. If one server goes down from a software defect, it is likely that the rest of the servers will fail from the same defect.

Cascading Failures

Preventing cascading failures is the very key to resilience.

In micro-service architectures, dependencies between projects resemble directed, acyclic graphs. A cascading failure occurs when problems in one service cause problems in calling services.

Make sure that your system can stay up when other systems go down.

Users

Users are a terrible thing.

Every user of your system consumes resources: memory, database connections, processing threads. Make sure you understand the worst case scenarios users can do. Test aggressively.

Blocked Threads

Application failures nearly always relate to blocked threads in one way or another.

Database connections, server connections, and third-party library code typically contain thread-based logic. Each of these are typically found near integration points and, if one of them blocks, can quickly lead to chain reactions and cascading failures. Blocked threads end up slowing responses, creating a feedback loop that amplifies a minor problem into total system failure.

Attacks of Self-Denial

Good marketing can kill you at any time.

Attacks of self-denial occur when humans facilitate flash mobs and traffic spikes on their own site. Make sure nobody sends mass emails with deep links. Create static landing zone pages for advertising campaigns. Communication within the organization is key.

Scaling Effects

Patterns that work fine in small environments might slow down or fail completely when you move to production sizes.

Whenever possible, build out a shared-nothing architecture. Each server operates independently, without the need for coordination or calls to a central service. With a shared-nothing architecture, capacity scales with the number of servers.

Unbalanced Capacities

Over short periods of time, your hardware capacity is fixed.

Imagine a traffic spike. Your front-end service experiences an increase in load that’s visible from a dashboard. Adding more servers to the front-end solves the problem. However, each of these front-end servers requires a connection to a back-end service that now is vastly under-provisioned.

In development and QA you probably test with one or two servers. In production, the ratio between different services might be drastically different against this reality.

Slow Responses

Generating a slow response is worse than returning an error.

A quick failure allows calling systems to process their transaction, and retry or fail as necessary. Slow responses, on the other hand, tie up resources on each calling system. Slow response can easily trigger cascading failures as request handling threads are tied up. It is better to fail fast.

SLA Inversion

When calling third parties, service levels can only decrease.

Your system is only as reliable as the systems it depends on. If you commit to a high SLA, are you sure that each of your dependent services can ensure you meet that SLA? Don’t make empty promises.

Unbounded Result Sets

Design with skepticism, and you will achieve resilience.

Be prepared for the possibility that queries for data return infinite results. Typically this will not happen with test data, but after code hits production. Be sure to test with realistic data volumes.

Colophon

All quotes are from the book “Release It!” by Michael Nygard.