Tuesday, 7 April 2020

Jon Allspaw - "Fault Injection in Production" [ACM Queue]

Why testing failure in production is important https://queue.acm.org/detail.cfm?id=2353017 building resilient systems requires experience with failure, and that we want to anticipate and confirm our expectations surrounding failure more often, not less often. Shying away from the effects of failure in a misguided attempt to reduce risk will result in poor designs, stale recovery skills, and a false sense of safety. Why production? Why not simulate this in a QA or staging environment? First, the existence of any differences in those environments brings uncertainty to the exercise, and second, the risk of not recovering has no consequences during testing, which can bring hidden assumptions into the fault-tolerance design and into recovery. The goal is to reduce uncertainty, not increase it.