Front Page

The Liberty Research Group

Parallelization Project

Software Modulated Fault Tolerance [abstract] (PDF)
George A. Reis
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, June 2008.

In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their smaller size and sheer number make chips more susceptible to transient faults. Designers frequently introduce redundant hardware or software to detect these faults because process and material advances are often insufficient to mitigate their effect. Regardless of the methods used for adding reliability, these techniques incur significant performance penalties because they uniformly protect the entire application. They do not consider the varying resilience to transient faults of different program regions. This uniform protection leads to wasted resources that reduce performance and/or increase cost.

To maximize fault coverage while minimizing the performance impact, this dissertation takes advantage of the key insights that not all faults in an unprotected application will cause an incorrect answer and not all parts of the program respond the same way to reliability techniques. First, this dissertation demonstrates the varying vulnerability and performance responses of an application and identifies regions which are most susceptible to faults as well as those which are inexpensive to protect. Second, this dissertation advocates the use of software and hybrid approaches to fault tolerance to enable the synergy of high-level information with specific redundancy techniques. Third, this dissertation demonstrates how to exploit this non-uniformity via Software Modulated Fault Tolerance. Software Modulated Fault Tolerance leverages reliability and performance information at a high level and directs the reliability choices at fine granularities to provide the most efficient use of processor resources for an application.