AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers [abstract] (PDF, Top Picks Version)
Grant Ayers, Nayana P. Nagendra, David I. August, Hyoun K. Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan
Proceedings of the 46th International Symposium on Computer Architecture (ISCA), June 2019.
Selected for IEEE Micro's "Top Picks" special issue "based on novelty and potential for long-term impact in the field of computer architecture" in 2019.
The large instruction working sets of private and public cloud
workloads lead to frequent instruction cache misses and costs
in the millions of dollars. While prior work has identified
the growing importance of this problem, to date, there has
been very little analysis of where the misses come from,
and what the opportunities are to improve them. To address
this challenge, this paper makes three contributions. First,
we present the design and deployment of a new, always-on,
fleet-wide monitoring system, AsmDB, that tracks front-end
bottlenecks. AsmDB uses hardware support to collect bursty
execution traces, fleet-wide temporal and spatial sampling,
and sophisticated offline post-processing to construct full-
program dynamic control-flow graphs. Second, based on a
longitudinal analysis of AsmDB data from a real-world online
services, we present two detailed insights on the sources
of front-end stalls: (1) cold code that is brought in along
with hot code leads to significant cache fragmentation and
a corresponding large number of instruction cache misses,
both at the function and cache line levels; (2) distant branches
and calls that are not amenable to traditional cache locality
or next-line prefetch strategies account for a large fraction
of cache misses. Third, we prototype two optimizations
that target these insights. For the first insight, we focus on
memcmp, one of the hottest functions contributing to cache
misses, and show how fine-grained layout optimizations lead
to significant benefits. For the second insight, we propose
new hardware support for software code prefetching and prototype
a new feedback-directed compiler optimization that
combines static program flow analysis with dynamic miss
profiles to demonstrate significant benefits on several large
warehouse-scale workloads. Improving upon prior work, our
proposal avoids invasive hardware modifications by prefetching
via software in an efficient and scalable way. Simulation
results show that such an approach can eliminate up to 96%
of instruction cache misses with negligible overheads.