An Analysis of Production Failures in Distributed Data-intensive Systems
Failure set
Aspirator: A simple static checker
One of our findings is that the cause of some of the most catastrophic failures, i.e., failures that affect all or majority of the users, are caused by some simple bugs in the error handling code. We further extracted a few rules from these bugs and built a static checker, Aspirator, to automatically detect these bugs. The source code of Aspirator is available here:
Our paper stirred quite a few online discussions in news, blogs, developers' mailing list, and hundreds of tweets (see this and this). Here goes a few of them (and if you wrote something about it, let us know!):