An Analysis of Production Failures in Distributed Data-intensive Systems
This page contains the detailed analysis of a large collection of failure reports. This is the dataset we used in our paper titled:
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. In the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14), October 2014.
The failures are from five widely used open source software projects: Cassandra, HDFS, Hadoop MapReduce, HBase, and Redis. In each diagnosis report, we document in detail the root causes and the symptoms of the failure, as well as the manifestation process. In addition, we also discuss whether each failure provides sufficient log messages for diagnosis, as well as how it was fixed. If a failure was reproduced by us, we also document the detail procedure of reproducing it.
Aspirator: A simple static checkerOne of our findings is that the cause of some of the most catastrophic failures, i.e., failures that affect all or majority of the users, are caused by some simple bugs in the error handling code. We further extracted a few rules from these bugs and built a static checker, Aspirator, to automatically detect these bugs. The source code of Aspirator is available here:
ImpactOur paper stirred quite a few online discussions in news, blogs, developers' mailing list, and hundreds of tweets (see this and this). Here goes a few of them (and if you wrote something about it, let us know!):
- Hacker News , ,
- Discussions from HBase developers, which prompted a series of reactions to address the problems we mentioned in the paper.
- Twitter discussions: see this, this, and this (if you're looking for a screenshot that summarizes our paper, see this or this).
- Another word for it.
- Fifty Quick Ideas to Improve Your Tests.
- Postmortem lessons.
- Some discussions on Google+.
- And quite a few emails sent to us from developers...
- Ding Yuan, yuan at eecg dot toronto dot edu