Design and Evaluation of Hybrid Fault-Detection Systems [abstract] (IEEE Xplore, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
Proceedings of the 32nd International Symposium on
Computer Architecture (ISCA), June 2005.
To improve performance and reduce power consumption, processor
designers employ advances that shrink feature sizes, lower voltage
levels, reduce noise margins, and increase clock rates. However, these
advances also make processors more susceptible to transient faults
that can affect program correctness. Up to now, system designers have
primarily considered hardware-only and software-only fault-detection
mechanisms to identify and mitigate the deleterious effects of
transient faults. These two fault-detection systems, however, are
extremes in the design space, representing sharp trade-offs between
hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection
mechanisms as promising alternatives to hardware- only and
software-only systems. These hybrid systems offer designers more
options to fit their reliability needs within their hardware and
performance budgets. We propose CRAFT, a suite of three such hybrid
techniques, to illustrate the potential of the hybrid approach. We
evaluate CRAFT in relation to existing hardware and software
reliability techniques. For fair, quantitative comparisons among
hardware, software, and hybrid systems, we introduce a new metric,
mean work to failure, which is able to compare systems for which
machine instructions do not represent a constant unit of
work. Additionally, we present a new simulation framework which
rapidly assesses reliability and does not depend on manual
identification of failure modes. Our evaluation illustrates that
CRAFT, and hybrid techniques in general, offer attractive options in
the fault-detection design space.