Parallel-Stage Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew Bridges, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.
In recent years, the microprocessor industry has embraced chip
multiprocessors (CMPs), also known as multi-core architectures, as the
dominant design paradigm. For existing and new applications to make effective
use of CMPs, it is desirable that compilers automatically extract
thread-level parallelism from single-threaded applications. DOALL is
a popular automatic technique for loop-level parallelization employed
successfully in the domains of scientific and numeric computing. While
the scalability of DOALL is only limited by the number of iterations of the loop, its
applicability is limited by the presence of loop-carried
dependences. A parallelization technique with greater applicability is decoupled
software pipelining(DSWP), which parallelizes loops even in the
presence of loop-carried dependences. However, the scalability of DSWP
is limited by the size of the loop body and the number of recurrences
it contains.
This work proposes a novel non-speculative compiler
parallelization technique called parallel-stage decoupled software
pipelining (PS-DSWP). The goal of PS-DSWP is to combine the
applicability of DSWP with the scalability of DOALL parallelization. A
key insight of PS-DSWP is that, after isolating the recurrences in
their own stages in DSWP, portions of the loop suitable
for DOALL parallelization may be exposed. PS-DSWP extends DSWP to
benefit from these opportunities, utilizing multiple threads to
execute the same stage of a DSWPed loop in parallel. This paper
describes the PS-DSWP transformation in detail and discusses its
implementation in a research compiler. PS-DSWP produces an average
speedup of 114% (up to a maximum of 155%) with 6 threads on loops from
a set of 5 applications. Our experiments also
demonstrate that PS-DSWP achieves better scalability with the
number of threads than DSWP.