Revisiting the Sequential Programming Model for Multi-Core [abstract] (IEEE Xplore, PDF, Top Picks Version)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2007.
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.
Single-threaded programming is already considered a complicated task.
The move to multi-threaded programming only increases the complexity
and cost involved in software development due to rewriting legacy
code, training of the programmer, increased debugging of the program,
and efforts to avoid race conditions, deadlocks, and other problems
associated with parallel programming. To address these costs, other
approaches, such as automatic thread extraction, have been explored.
Unfortunately, the amount of parallelism that has been automatically
extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic
limitation of the sequential programming model, but rather occurs for
two reasons. First, there exists no framework for automatic thread
extraction that brings together key existing state-of-the-art compiler
and hardware techniques. This paper shows that such a framework can
yield scalable parallelization on several SPEC CINT2000 benchmarks.
Second, existing sequential programming languages force programmers to
define a single legal program outcome, rather than allowing for a
range of legal outcomes. This paper shows that natural, simple
extensions to the sequential programming model enable parallelization
for the remainder of the SPEC CINT2000 suite. Our experience
demonstrates that, by changing only 60 source code lines, all of the C
benchmarks in the SPEC CINT2000 suite were parallelizable by automatic
thread extraction. This process, constrained by the limits of modern
optimizing compilers, yielded a speedup of 454% on these
applications.