DAFT: Decoupled Acyclic Fault Tolerance [abstract] (ACM DL, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2010.
Selected by the Program Committee as a "top paper" for inclusion
in a special issue of the International Journal of Parallel
Programming (IJPP).
Higher transistor counts, lower voltage levels, and reduced noise
margin increase the susceptibility of multicore processors to
transient faults. Redundant hardware modules can detect such errors,
but software transient fault detection techniques are more appealing
for their low cost and flexibility. Recent software proposals double
register pressure or memory usage, or are too slow in the absence of
hardware extensions, preventing widespread acceptance. This paper
presents DAFT, a fast, safe, and memory efficient transient fault
detection framework for commodity multicore systems. DAFT replicates
computation across multiple cores and schedules fault detection off
the critical path. Where possible, values are speculated to be correct
and only communicated to the redundant thread at essential program
points. DAFT is implemented in the LLVM compiler framework and
evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a
commodity multicore system. Results demonstrate DAFT's high
performance and broad fault coverage. Speculation allows DAFT to
reduce the performance overhead of software redundant multithreading
from an average of 200% to 38% with no degradation of fault
coverage.