Software-Controlled Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
ACM Transactions on Architecture and Code Optimization (TACO), December 2005.
Traditional fault tolerance techniques typically utilize
resources ineffectively because they cannot adapt to the changing
reliability and performance demands of a system. This paper proposes
software-controlled fault tolerance, a concept allowing designers and
users to tailor their performance and reliability for each
situation. Several software-controllable fault detection techniques
are then presented: SWIFT, a software-only technique, and CRAFT, a
suite of hybrid hardware/ software techniques. Finally, the paper
introduces PROFiT, a technique which adjusts the level of protection
and performance at fine granularities through software control. When
coupled with software-controllable techniques like SWIFT and CRAFT,
PROFiT offers attractive and novel reliability options.