Software Modulated Fault Tolerance [abstract] (PDF)
George A. Reis
Ph.D. Thesis, Department of Electrical Engineering,
Princeton University, June 2008.
In recent decades, microprocessor performance has been increasing
exponentially, due in large part to smaller and faster transistors
enabled by improved fabrication technology. While such transistors
yield performance enhancements, their smaller size and sheer number
make chips more susceptible to transient faults. Designers frequently
introduce redundant hardware or software to detect
these faults because process and material advances are often
insufficient to mitigate their effect.
Regardless of the methods used for adding reliability, these
techniques incur significant performance penalties because they
uniformly protect the entire application. They do not consider
the varying resilience to transient faults of different
program regions. This uniform protection leads to wasted resources that
reduce performance and/or increase cost.
To maximize fault coverage while minimizing the performance impact,
this dissertation takes advantage of the key insights that not all faults in
an unprotected application will cause an incorrect answer and not all
parts of the program respond the same way to reliability techniques.
First, this dissertation demonstrates the varying vulnerability and
performance responses of an application and identifies regions
which are most susceptible to faults as well as those which are
inexpensive to protect. Second, this dissertation advocates the use of
software and hybrid approaches to fault tolerance to enable the
synergy of high-level information with specific redundancy techniques.
Third, this dissertation demonstrates how to exploit this non-uniformity via
Software Modulated Fault Tolerance. Software Modulated Fault Tolerance leverages reliability
and performance information at a high level and directs the
reliability choices at fine granularities to provide the most
efficient use of processor resources for an application.