Testing in Production — Building Observable Distributed Systems

Complex systems exhibit unexpected behavior. — John Gall, The Systems Bible One reaction to the myriad failure cases we encounter with distributed systems is to add more testing. Unfortunately, testing is a best-effort verification of system correctness — we simply cannot predict the failure cases that will happen in production. What’s more, any environment that we use to verify system behaviour is — at best — a pale imitation of our production environment. An alternative to adding more tests is to add more monitoring and alerting so we know as soon as possible that a system shows signs of degradation. Unfortunately, monitoring and alerting suffers from the same faults as testing — we monitor for behaviour that is predictable in nature or that we have experienced in the past. Together, monitoring and testing try to enumerate all possible permutations of partial and total system failure. Unfortunately, they only provide a simulation of how you expect a system to actually function in production. ...

September 25, 2018 · 8 min · Kevin Sookocheff