Global Instruction Scheduling for Multi-Threaded Architectures [abstract] (PDF)
Guilherme de Lima Ottoni
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2008.
Recently, the microprocessor industry has moved toward multi-core or
chip multiprocessor (CMP) designs as a means of utilizing the
increasing transistor counts in the face of physical and
micro-architectural limitations. Despite this move, CMPs do not
directly improve the performance of single-threaded codes, a
characteristic of most applications. In effect, the move to CMPs has
shifted even more the task of improving performance from the hardware
to the software.
Since developing parallel applications has long been recognized as
significantly harder than developing sequential ones, it is very
desirable to have automatic tools to extract thread-level parallelism
(TLP) from sequential applications. Unfortunately, automatic
parallelization has only been successful in the restricted domains of
scientific and data-parallel applications, which usually have regular
array-based memory accesses and little control flow. In order to
support parallelization of general-purpose applications, computer
architects have proposed CMPs with light-weight, fine-grained (scalar)
communication mechanisms. Despite such support, most existing
multi-threading compilation techniques have generally demonstrated
little effectiveness in extracting parallelism from non-scientific
applications. The main reason for this is that such techniques are
mostly restricted to extracting parallelism within local,
straight-line regions of code. As observed in several limit studies,
local regions of code contain limited parallelism, and control
dependence analysis and multiple control units are necessary to
extract most of the parallelism opportunities.
This thesis describes a general compiler framework for Global
Multi-Threaded (GMT) instruction scheduling, i.e. to simultaneously
schedule instructions from a global region of code to extract TLP for
multi-threaded architectures. Our compiler framework is based on a
Program Dependence Graph (PDG) representation, efficient graph
partitioning algorithms, and novel multi-threaded code generation
algorithms. Central to this framework are our multi-threaded code
generation algorithms, which produce efficient code for arbitrary
partitions of the PDG into threads. Based on this framework, three
thread-partitioning strategies for GMT instructions scheduling are
proposed. The first one, called GREMIO, extends list-scheduling to
target multi-threaded architectures and to operate on global code
regions. The second technique, called Decoupled Software Pipelining
(DSWP), extracts pipelined TLP from arbitrary loop nests. We also
propose Parallel-Stage DSWP, an extension of DSWP that allows multiple
threads to concurrently execute the same stage of the pipeline. These
techniques are implemented in the VELOCITY compiler and evaluated on
an accurate CMP simulator built on top of validated Itanium 2 core
models. The experiments show that our techniques balance
applicability and scalability differently, with each technique
resulting in the best speedup in different scenarios. Overall, the
results demonstrate the effectiveness of the proposed compilation
techniques, with significant speedups on a number of real benchmark
applications written in C.