Runtime Asynchronous Fault Tolerance via Speculation [abstract] (PDF)
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation
and Optimization (CGO), April 2012.
Transient faults are emerging as a critical reliability concern
in modern microprocessors. Redundant hardware solutions are commonly deployed
to detect transient faults, but they are less flexible and cost-effective than
software solutions. However, software solutions are rendered impractical
because of high performance overheads. To address this problem, this paper
presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the
fastest transient fault detection technique known to date. Serving as a virtual
layer between the application and the underlying platform, RAFT automatically
generates two symmetric program instances from a program binary. It detects
transient faults in a noninvasive way, and exploits high-confidence value
speculation to achieve low runtime overhead. Evaluation on a commodity
multicore system demonstrates that RAFT delivers a geomean performance overhead
of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient
fault detection techniques, RAFT exhibits the best performance and fault
coverage, without requiring any change to the hardware or the software
applications