DAFT: Decoupled Acyclic Fault Tolerance [abstract] (SpringerLink, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
The International Journal of Parallel Programming (IJPP), February 2012.
Invited.
Special issue composed of "top papers" selected by the
Program Committee of the 19th International Conference on Parallel
Architectures and Compilation Techniques.
Higher transistor counts, lower voltage levels, and reduced noise
margin increase the susceptibility of multicore processors to
transient faults. Redundant hardware modules can detect such errors,
but software transient fault detection techniques are more appealing
for their low cost and flexibility. Recent software proposals double
register pressure or memory usage, or are too slow in the absence of
hardware extensions, preventing widespread acceptance. This paper
presents DAFT, a fast, safe, and memory efficient transient fault
detection framework for commodity multicore systems. DAFT replicates
computation across multiple cores and schedules fault detection off
the critical path. Where possible, values are speculated to be correct
and only communicated to the redundant thread at essential program
points. DAFT is implemented in the LLVM compiler framework and
evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a
commodity multicore system. Results demonstrate DAFT's high
performance and broad fault coverage. Speculation allows DAFT to
reduce the performance overhead of software redundant multithreading
from an average of 200% to 38% with no degradation of faultcoverage.