In this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half its time stalling for L2 misses. Our experimental analysis begins with an effort to tune our baseline memory system aggressively: incorporating optimizations to reduce DRAM row buffer misses, reordering miss accesses to reduce queuing delay, and adjusting the L2 block size to match each channel organization. We show that there is a large gap between the block sizes at which performance is best and at which miss rate is minimized. Using those results, we evaluate a hardware prefetch unit integrated with the L2 cache and memory controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row buffer hits, and giving them low replacement priority, we achieve a 65 percent speedup across 10 of the 26 SPEC2000 benchmarks, without degrading the performance of the others. With eight Rambus channels, these 10 benchmarks improve to within 10 percent of the performance of a perfect L2 cache.
All Science Journal Classification (ASJC) codes