Instructor: Ashvin Goel
Course Number: ECE1781H
Course Time: Feb, 1-3 pm
Course Room: BA4164
Start Date: Jan 11, 2019

Home
Accessing Papers
Presentation Format
Project Format
Project Ideas

Dependable Software Systems

ECE1781, Winter 2019
University of Toronto


Course Description

Modern computer systems have become tightly intertwined with our daily lives. However, they are failure-prone and difficult to manage and thus hardly dependable. Today, these problems dominate total cost of ownership of computer systems, and unfortunately they have no simple solutions. There is a realization that these problems cannot be decisively solved but are ongoing facts of life that must be dealt with regularly. To do so, systems should be designed to detect, isolate and recover from these problems.

This advanced graduate-level course focuses on dependability in software systems and examines current research that aims to address challenges caused by software and hardware bugs and software misconfiguration. Students are expected to read and critique recent research papers in operating systems that cover these areas. They are also expected to work on a research project and make class presentations. While there are no specific prerequisites for this course, students who have taken undergraduate or graduate courses in operating systems, networks and distributed systems will have an edge.

Textbooks

There are no required textbooks for this course. The optional textbooks are

  • Modern Operating Systems (Third Edition), by Andrew S. Tanenbaum. Published by Prentice Hall, 2008.
  • Distributed Systems: Concepts and Design (Fourth Edition), by George Coulouris, Jean Dollimore and Tim Kindberg. Published by Addison Wesley, 2005.

Mailing List

Please subscribe to the class mailing list by joining the UofT ECE1781 Google Group. Subscribing to the group may require the instructor's approval.

The instructor will use this group to send instructions and reminders. You can send email to the class by sending mail to this list. If you have a specific question for the instructor, please send an email to the instructor directly.

Grading Policy

Grades will be based on class presentations, a class project, and class participation. There will be no final exam in this course. The grading breakup is as follows:

  • Class presentation: 30%
  • Class project: 50%
  • Class participation: 20%

Note: If a student is unable to attend a class, he or she will lose 2% for non-participation.

Class Presentation

Each week this class will cover a group of papers that focuses on a specific aspect of the course. Students are expected to read all the papers in the group that will be presented. At the beginning of the term, each paper will be assigned to a student who will be presenting the paper. Presentations will be limited to roughly 20 minutes.

More details about the presentation format. Please read very carefully.

Assignments

There will be no assignments in this course.

Class Project

A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.

More details about the project format. Please read very carefully.

Project Ideas

Here is a list of project ideas.

Readings

This is a tentative list. Most of these papers can be accessed from the ACM web site. If you cannot access ACM articles directly, please read the following instructions for accessing the papers.

Week 1: Introduction (Jan 11)

  1. Why Do Computers Stop and What Can Be Done About It? SRDS 1986.
  2. Broad New OS Research: Challenges and Opportunities. HOTOS 2005.
  3. Introduction to Dependable Software Systems by Instructor.
  4. Efficient Readings of Papers in Science and Technology.
  5. How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.

Week 2: Bug Finding and Testing (Jan 18)

  1. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. SOSP 2001. Nikhil
  2. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. OSDI 2008. Charles

     Optional reading:

  1. Using Model Checking to Find Serious File System Errors. OSDI 2004.
  2. eXplode: A lightweight, general system for finding serious storage system errors. OSDI 2006.
  3. Hang Analysis: Fighting Responsiveness Bugs. Eurosys 2008.
  4. Cross-checking Semantic Correctness: The Case of Finding File System Bugs. SOSP 2015.
  5. How to Build Static Checking Systems Using Orders of Magnitude Less Code. ASPLOS 2016.

Week 3: Races (Jan 25)

  1. Eraser: A Dynamic Data Race Detector for Multi-Threaded Programs. SOSP 1997. Shihan
  2. Effective Data-Race Detection for the Kernel. OSDI 2010. Elton

     Optional reading:

  1. RacerX: Effective, Static Detection of Race Conditions and Deadlocks. SOSP 2003.
  2. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI 2008.
  3. Deadlock Immunity: Enabling Systems to Defend Against Deadlocks. OSDI 2008.
  4. CTrigger: Exposing Atomicity Violation Bugs from Their Hiding Places. ASPLOS 2009.
  5. Operating Systems Transactions. SOSP 2009.
  6. Bypassing Races in Live Applications with Execution Filters. OSDI 2010.
  7. Ad Hoc Synchronization Considered Harmful. OSDI 2010.
  8. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. ASPLOS 2010.
  9. Detecting and Surviving Data Races using Complementary Schedules. SOSP 2011.
  10. Pervasive Detection of Process Races in Deployed Systems. SOSP 2011.
  11. Applying Transactional Memory to Concurrency Bugs. ASPLOS 2012.
  12. Data Races vs. Data Race Bugs: Telling the Difference with Portend. ASPLOS 2012.
  13. Automated Concurrency-Bug Fixing. OSDI 2012.
  14. SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration. OSDI 2014.
  15. Lazy Diagnosis of In-Production Concurrency Bugs. SOSP 2017.

Week 4: Debugging and Failure Diagnosis (Feb 1)

  1. REPT: Reverse Debugging of Failures in Deployed Software. OSDI 2018. Andrew
  2. Orca: Differential Bug Localization in Large-Scale Services. OSDI 2018. Andrew

     Optional reading:

  1. Triage: Diagnosing Production Run Failures at the User's Site. SOSP 2007.
  2. R2: An Application-Level Kernel for Record and Replay. OSDI 2008.
  3. Execution Synthesis: A Technique for Automated Software Debugging. Eurosys 2010.
  4. Anomaly-Based Bug Prediction, Isolation, and Validation: An Automated Approach for Software Debugging. ASPLOS 2009.
  5. ODR: Output-Deterministic Replay for Multicore Debugging. SOSP 2009.
  6. Be Conservative: enhancing failure diagnosis with proactive logging. OSDI 2012.
  7. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. OSDI 2014.
  8. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures. SOSP 2015.
  9. Towards Practical Default-On Multi-Core Record/Replay. ASPLOS 2017.
  10. Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach. SOSP 2017.
  11. Capturing and Enhancing In Situ System Observability for Failure Detection. OSDI 2018.

Week 5: Generic Failure Recovery

  1. Exploring Failure Transparency and the Limits of Generic Recovery. OSDI 2000. Junbang
  2. Rx: Treating Bugs As Allergies---A Safe Method to Survive Software Failures. SOSP 2005. Jeffrey

      Optional reading:

  1. Enhancing Server Availability and Security Through Failure-Oblivious Computing. OSDI 2004.
  2. ASSURE: Automatic Software Self-healing Using REscue points. ASPLOS 2009.

Week 6: Application-Specific Recovery (Feb 15 - first report due)

  1. Undo for Operators: Building an Undoable E-mail Store. Usenix 2003. Chenming
  2. Microreboot - A Technique for Cheap Recovery. OSDI 2004. Gengrui

      Optional reading:

Reading Week: No Class (Feb 22)

Week 7: OS Recovery (Mar 1)

  1. CuriOS: Improving Reliability through Operating System Structure. OSDI 2008. Elton
  2. Recovery Domains: An Organizing Principle for Recoverable Operating Systems. ASPLOS 2009. Zhiqi

      Optional reading:

Week 8: OS Extension Reliability (Mar 8)

  1. Tolerating Hardware Device Failures in Software. SOSP 2009. Jeffrey
  2. Guardrail: A High Fidelity Approach To Protecting Hardware Devices From Buggy Drivers. ASPLOS 2014. Seung-Hun

     Optional reading:

  1. Dealing With Disaster: Surviving Misbehaved Kernel Extensions. OSDI 1996.
  2. Improving the Reliability of Commodity Operating Systems. SOSP 2003.
  3. Composing OS extensions safely and efficiently with Bascule. Eurosys 2013.
  4. SymDrive: Testing Drivers without Devices? OSDI 2012.

Week 9: Storage Reliability (Mar 15 - second report due)

  1. The FuzzyLog: A Partially Ordered Shared Log. OSDI 2018. Seung-Hun
  2. Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems. OSDI 2018. Charles

     Optional reading:

  1. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. FAST 2017.
  2. Correlated Crash Vulnerabilities. OSDI 2016.

Week 10: File System Reliability (Mar 22)

  1. Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing. OSDI 2018. Zhiqi
  2. Membrane: Operating System Support for Restartable File Systems. FAST 2010. Albert

     Optional reading:

  1. Iron File Systems. SOSP 2005.
  2. Improving File System Reliability with I/O Shepherding. SOSP 2007.
  3. Analyzing the effects of disk-pointer corruption. DSN 2008.
  4. Recon: Verifying File System Consistency at Runtime. FAST 2012.
  5. HARDFS: Hardening HDFS with Selective and Lightweight Versioning. FAST 2013.

Week 11: Formal Techniques (Mar 29)

  1. SibylFS: formal specification and oracle-based testing for POSIX and real-world file systems. SOSP 2015. Chenming
  2. Push-Button Verification of File Systems via Crash Refinement. OSDI 2016. Jayavanta

     Optional reading:

  1. Using Crash Hoare Logic for Certifying the FSCQ File System. SOSP 2015.
  2. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. OSDI 2014.
  3. Specifying and Checking File System Crash-Consistency Models. ASPLOS 2016.

Week 12: System Misconfiguration (Apr 5)

  1. Do Not Blame Users for Misconfigurations. SOSP 2013. Yilun
  2. Early Detection of Configuration Errors to Reduce Failure Damage. OSDI 2016. Zihan

     Optional reading:

  1. Understanding and Dealing with Operator Mistakes in Internet Services. OSDI 2004.
  2. Configuration Debugging as Search: Finding the Needle in the Haystack. OSDI 2004.
  3. Automatic Misconfiguration Troubleshooting with PeerPressure. OSDI 2004.
  4. AutoBash: Improving Configuration Management with Operating System Causality Analysis. SOSP 2007.
  5. Enabling Configuration-Independent Automation by Non-Expert Users. OSDI 2010.
  6. Barricade: Defending Systems Against Operator Mistakes. Eurosys 2010.
  7. Fingerprinting the Datacenter: Automated Classification of Performance Crises. Eurosys 2010.
  8. Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. OSDI 2010.
  9. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. SOSP 2011.
  10. Automatic Root-Cause Diagnosis of Performance Anomalies in Production Software. OSDI 2012.
  11. EnCore: Exploiting System Environment And Correlation Information For Misconfiguration Detection. ASPLOS 2014.
  12. ConfValley: A Systematic Configuration Validation Framework for Cloud Services. Eurosys 2015.

Week 13: No class (Apr 12)

Instructor is away.

Week 14: Project Presentations (Apr 19 - final report due)