Front Page

The Liberty Research Group

Parallelization Project

Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker
Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007.
Accept Rate: 25% (45/178).
Winner Best Paper Award.

A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the results of replicas to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. First, we provide an operational semantics for our assembly language, which includes a precise formal denition of our fault model. Formulating such a model is a crucial step forward for the science of hardware fault tolerance as it pins down the assumptions being made about when and where faults may occur in a completely precise and transparent manner. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. On the one hand, this type system may be viewed as a theoretical tool for proving that code is reliable. On the other hand, it may be used as a practical tool for compiler debugging. In the latter case, the type system is strictly superior to conventional testing as it guarantees full coverage and perfect fault tolerance relative to the fault model. Third, we provide a formal specication for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate the execution time of our detection scheme to quantify the performance penalty for sound fault tolerance.