Support for High-Frequency Streaming in CMPs [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Adam Stoler, Guilherme Ottoni, David I. August, and George Z. N. Cai
Proceedings of the 39th IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2006.
As the industry moves toward larger-scale chip multiprocessors, the
need to parallelize applications grows. High inter-thread
communication delays, exacerbated by over-stressed high-latency memory
subsystems and ever-increasing wire delays, require parallelization
techniques to create partially or fully independent threads to improve
performance. Unfortunately, developers and compilers alike often fail
to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown
significant promise for both manual and automatic parallelization.
These techniques have wide-scale applicability because they embrace
inter-thread dependences (albeit acyclic dependences) and tolerate
long-latency communication of these dependences. This paper addresses
the lack of architectural support for this type of concurrency, which
has blocked its adoption and hindered related language and compiler
research. We observe that both manual and automatic techniques create
high-frequency streaming threads, with communication occurring
every 5 to 20 instructions. Even while easily tolerating
inter-thread transit delays, this high-frequency communication
makes thread performance very sensitive to intra-thread delays
from the repeated execution of the communication operations. Using
this observation, we define the design-space and evaluate several
mechanisms to find a better trade-off between performance and
operating system, hardware, and design costs. From this, we find a
light-weight streaming-aware enhancement to conventional memory
subsystems that doubles the speed of these codes and is within 2% of
the best-performing, but heavy-weight, hardware solution.