Global Multi-Threaded Instruction Scheduling: Technique and
Initial Results [abstract] (CiteSeerX, PDF)
Guilherme Ottoni and David I. August
Proceedings of the Sixth Workshop on Explicitly Parallel
Instruction Computer Architectures and Compiler Technology (EPIC), March 2007.
Recently, the microprocessor industry has reached hard
physical and micro-architectural limits that have prevented the
continuous clock-rate increase, which had been the major source of
performance gains for decades. These impediments, in conjunction with
the still increasing transistor counts per chip, have driven all major
microprocessor manufacturers toward Chip Multiprocessor (CMP) designs.
Although CMPs are able to concurrently pursue multiple threads of
execution, they do not directly improve the performance of most
applications, which are written in sequential languages. In effect,
the move to CMPs has shifted even more the task of improving
performance from the hardware to the software. In order to support
more effective parallelization of sequential applications, computer
architects have proposed CMPs with light-weight communication
mechanisms. Despite such support, proposed multi-threaded scheduling
techniques have generally demonstrated little effectiveness in
extracting parallelism from general-purpose, sequential applications.
We call these techniques local multi-threaded scheduling,
because they basically exploit parallelism within straight-line
regions of code. A key observation of this paper is that local
multi-threaded techniques do not exploit the main feature of CMPs: the
ability to concurrently execute instructions from different
control-flow regions. In order to benefit from this powerful CMP
characteristic, it is necessary to perform global multi-threaded
scheduling, which simultaneously schedules instructions from
different basic blocks to enable their concurrent execution. This
paper presents algorithms to perform global scheduling for
communication-exposed multi-threaded architectures. By global
we mean that our technique simultaneously schedules instructions from
an arbitrary code region. Very encouraging preliminary results,
targeting a dual-core Itanium~2 model, are presented for selected
MediaBench and SPEC-CPU benchmark applications.