Front Page

The Liberty Research Group

Parallelization Project

SWIFT: Software Implemented Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the Third International Symposium on Code Generation and Optimization (CGO), March 2005.
Accept Rate: 33% (25/75).
Winner Best Paper Award.
Winner of the 2015 International Symposium on Code Generation and Optimization Test of Time Award.

To improve performance and reduce power consumption, processor designers employ advances that shrink feature sizes, lower voltage levels, reduce noise margins, and increase clock rates. These advances, however, also make processors more susceptible to transient faults that can affect program correctness. To mitigate this increasing problem, designers build redundancy into systems to the degree that the soft-error budget will allow.

While reliable systems typically employ hardware techniques to address soft-errors, software techniques can provide a lower cost and more flexible alternative. To make this alternative more attractive, this paper presents a new software fault tolerance technique, called SWIFT, for detecting transient errors. Like other single-threaded software fault tolerance techniques, SWIFT efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs. SWIFT, however, eliminates the need to double the memory requirement by acknowledging the use of ECC in caches and memory. SWIFT also provides a higher level of protection with enhanced checking of the program counter (PC) at no performance cost. In addition, this enhanced PC checking makes most code inserted to detect faults in prior methods unnecessary, significantly enhancing performance. While SWIFT can be implemented on any architecture and can protect individual code segments to varying degrees, we evaluate a fully-redundant implementation running on Itanium 2. In these experiments, SWIFT demonstrates exceptional fault-coverage with a reasonable performance cost. Compared to the best known single-threaded approach utilizing an ECC memory system, SWIFT demonstrates a 51% average speedup.