Hardware-Modulated Parallelism in Chip Multiprocessors [abstract] (ACM DL, PDF)
Julia Chen, Philo Juang, Kevin Ko, Gilberto Contreras, David Penry, Ram Rangan, Adam Stoler, Li-Shiuan Peh, and Margaret Martonosi
2005 Workshop on Design, Architecture and Simulation of Chip Multi-Processors (dasCMP), November 2005.
Chip multi-processors(CMPs) already have widespread
commercial availability, and technology roadmaps project enough
on-chip transistors to replicate tens or hundreds of current
processor cores. How will we express parallelism, partition
applications, and schedule/place/migrate threads on these
highly-parallel CMPs? This paper presents and evaluates a new approach to
highly-parallel CMPs, advocating a new hardware-software contract.
The software layer is encouraged to expose large amounts of
multi-granular, heterogeneous parallelism. The hardware, meanwhile,
is designed to offer low-overhead, low-area support for orchestrating
and modulating this parallelism on CMPs at runtime. Specifically, our
proposed CMP architecture consists of architectural and ISA support
targeting thread creation, scheduling and context-switching, designed
to facilitate effective hardware run-time mapping of threads to cores
at low overheads. Dynamic modulation of parallelism provides the ability to respond
to run-time variability that arises from dataset changes, memory
system effects and power spikes and lulls, to name a few. It also
naturally provides a long-term CMP platform with performance
portability and tolerance to frequency and reliability variations
across multiple CMP generations. Our simulations of a range of
applications posessing do-all, streaming and recursive parallelism
show speedups of 4-11.5X and energy-delay-product savings of 3.8X, on
average, on a 16-core vs. a 1-core system. This is achieved with
modest amounts of hardware support that allows for low overheads in
thread creation, scheduling, and context-switching. In particular,
our simulations motivated the need for hardware support, showing that
the large thread management overheads of current run-time software
systems can lead to up to 6.5X slowdown. The difficulties faced in
static scheduling were shown in our simulations with a static
scheduling algorithm, fed with oracle profiled inputs suffering up to
107% slowdown compared to NDP's hardware scheduler, due to its
inability to handle memory system variabilities. More broadly, we
feel that the ideas presented here show promise for scaling to the
systems expected in ten years, where the advantages of high transistor
counts may be dampened by difficulties in circuit variations.