The VELOCITY Compiler: Extracting Efficient Multicore Execution from Legacy Sequential Programs [abstract] (PDF)
Matthew John Bridges
Ph.D. Thesis, Department of Computer Science,
Princeton University, November 2008.
Multiprocessor systems, particularly chip multiprocessors, have
emerged as the predominant organization for future microprocessors.
Systems with 4, 8, and 16 cores are already shipping and a future with
32 or more cores is easily conceivable. Unfortunately, multiple cores
do not always directly improve application performance, particularly
for a single legacy application. Consequently, parallelizing
applications to execute on multiple cores is essential.
Parallel programming models and languages could be used to create
multi-threaded applications. However, moving to a parallel programming
model only increases the complexity and cost involved in software
development. Many automatic thread extraction techniques have been
explored to address these costs.
Unfortunately, the amount of parallelism that has been automatically
extracted using these techniques is generally insufficient to keep
many cores busy. Though there are many reasons for this, the main
problem is that extensions are needed to take full advantage of these
techniques. For example, many important loops are not parallelized
because the compiler lacks the necessary scope to apply the
optimization. Additionally, the sequential programming model forces
programmers to define a single legal application outcome, rather than
allowing for a range of legal outcomes, leading to conservative
dependences that prevent parallelization.
This dissertation integrates the necessary parallelization
techniques, extending them where necessary, to enable automatic
thread extraction. In particular, this includes an expanded
optimization scope, which facilitates the optimization of large loops,
leading to parallelism at higher levels in the
application. Additionally, this dissertation shows that many unnecessary
dependences can be broken with the help of the programmer using
natural, simple extensions to the sequential programming model.
Through a case study of several applications, including several C
benchmarks in the SPEC CINT 2000 suite, this dissertation shows how
scalable parallelism can be extracted. By changing only 38 source
code lines out of 100,000, several of the applications were
parallelizable by automatic thread extraction techniques, yielding a
speedup of 3.64x on 32 cores.