摘要
工作负载分析是片上多处理器末级缓存设计的关键先导工作。分析了一组访存密集型多线程RMS(recognition-mining-synthesis)工作负载工作集大小、数据共享行为和空间局部性等访存行为,研究了末级缓存的设计空间,探讨了未来片上多处理器的缓存体系结构设计。实验结果表明:大容量DRAM缓存有助于满足这组负载的大工作集对缓存容量的需求,使用128MB DRAM缓存比不使用时平均可以减少18%的L1缓存缺失延迟;共享缓存设计比私有设计性能更好,8MB的共享缓存可以比相同总容量的私有缓存提高25%的缓存性能;基于步长的硬件数据预取机制可以提高25%的性能。因此,对于访存密集型RMS负载,宜采用一个128MB的DRAM缓存、一个8MB片上SRAM缓存,结合一个8表项的流式预取器,构成缓存子系统。
Workload characterization is a key leading job for the design of last-level caches (LLCs) on multi core processors. This paper analyzes the memory behavior of emerging RMS (recognition, mining, and synthesis) workloads for future multl-core processors, including the working set sizes, data sharing behavior, and spatial data locality, which shows that these RMS workloads are memory intensive, with large working set sizes, a significant amount of data sharing, and strong strided access patterns. The LLC design space was then explored for multi-threaded RMS workloads and the potential architectural choices were discussed for future multi-core cache design based on the observations. The experimental results show that large DRAM caches can effectively satisfy the cache requirement caused by large working sets with a 128 MB DRAM cache significantly reducing the average L1 miss penalty by 18% ; that the shared cache provides better performance than the private cache at the LLC level with a 8 MB shared cache improving the cache performance by 25% compared with a private cache with the same size in total; and that stride based hardware prefetehing mechanism provides significant performance improvement by 25 %. Consequently, a memory hierarchy is given with a 128 MB DRAM cache, an 8 MB on die SRAM shared cache, and an 8-entry stride prefetcher for the RMS workloads.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2011年第8期1055-1062,1071,共9页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金资助项目(60573100
60773149)
国家"八六三"高技术项目(2008AA01Z108)
国家"九七三"重点基础研究项目(2007CB310900)
关键词
片上多处理器
片上缓存
负载分析
访存性能
RMS负载
chip multiprocessor
on-chip cache
workload characterization
memory performance
RMS workload