Configurable Transient Fault Detection via Dynamic Binary Translation [abstract] (PDF)
George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.
Smaller feature sizes, lower voltage levels, and reduced noise margins
have helped improve the performance and lower the power consumption
of modern microprocessors. These same advances have
made processors more susceptible to transient faults that can corrupt
data and make systems unavailable. Designers often compensate for
transient faults by adding hardware redundancy and making circuitand
process-level adjustments. However, applications have different
data integrity and availability demands, which make hardware
approaches such as these too costly for many markets.. Software techniques can provide fault tolerance at a lower cost and
with greater flexibility since they can be selectively deployed in the
field even after the hardware has been manufactured. Most existing
software-only techniques use recompilation, requiring access to program
source code. Regardless of the code transformation method,
previous techniques also incur unnecessary significant performance
penalties by uniformly protecting the entire program without taking
into account the varying vulnerability of different program regions
and state elements to transient faults. This paper presents Spot, a software-only fault-detection technique
which uses dynamic binary translation to provide softwaremodulated
fault tolerance with fine-grained control of redundancy.
By using dynamic binary translation, users can improve the reliability
of their applications without any assistance from hardware or
software vendors. By using software-modulated fault tolerance, Spot
can vary the level of protection independently for each register and
region of code to provide users with more, and often superior, faultdetection
options. This feature of Spot increases the mean work to
failure from 1.90x to 17.79x.