The Acceleration of Structural Microarchitectural Simulation via Scheduling [abstract] (PDF)
David A. Penry
Ph.D. Thesis, Department of Computer Science,
Princeton University, November 2006.
Microarchitects rely upon simulation to evaluate design alternatives,
yet constructing an accurate simulator by hand is a difficult and
time-consuming process because simulators are usually written in
sequential languages while the system being modeled is concurrent.
Structural modeling can mitigate this difficulty by allowing the
microarchitect to specify the simulation model in a concurrent,
structural form; a simulator compiler then generates a simulator from
the model. However, the resulting simulators are generally slower
than those produced by hand. The thesis of this dissertation is that
simulation speed improvements can be obtained by careful scheduling of
the work to be performed by the simulator onto single or multiple
processors.
For scheduling onto single processors, this dissertation presents an
evaluation of previously proposed scheduling mechanisms in the context
of a structural microarchitectural simulation framework which uses a
particular model of computation, the Heterogeneous Synchronous
Reactive (HSR) model, and improvements to these mechanisms which make
them more effective or more feasible for microarchitectural models. A
static scheduling technique known as partitioned scheduling is shown
to offer the most performance improvement: up to 2.08 speedup. This
work furthermore proves that the the Discrete Event model of
computation can be statically scheduled using partitioned scheduling
when restricted in ways that are commonly assumed in
microarchitectural simulation.
For scheduling onto multiple processors, this dissertation presents
the first automatic parallelization of simulators using the HSR model
of computation. It shows that effective parallelization requires
techniques to avoid waiting due to locks and to improve cache
locality. Two novel heuristics for lock mitigation and two for cache
locality improvement are introduced and evaluated on three different
parallel systems. The combination of lock mitigation and locality
improvement is shown to allow superlinear speedup for some models: up
to 7.56 for four processors.