新兴多核工作负载访存行为的定量分析

Quantitative analysis of the emerging multi-core workload memory behavior

导出

摘要工作负载分析是片上多处理器末级缓存设计的关键先导工作。分析了一组访存密集型多线程RMS(recognition-mining-synthesis)工作负载工作集大小、数据共享行为和空间局部性等访存行为,研究了末级缓存的设计空间,探讨了未来片上多处理器的缓存体系结构设计。实验结果表明:大容量DRAM缓存有助于满足这组负载的大工作集对缓存容量的需求,使用128MB DRAM缓存比不使用时平均可以减少18%的L1缓存缺失延迟;共享缓存设计比私有设计性能更好,8MB的共享缓存可以比相同总容量的私有缓存提高25%的缓存性能;基于步长的硬件数据预取机制可以提高25%的性能。因此,对于访存密集型RMS负载,宜采用一个128MB的DRAM缓存、一个8MB片上SRAM缓存,结合一个8表项的流式预取器,构成缓存子系统。 Workload characterization is a key leading job for the design of last-level caches （LLCs） on multi core processors. This paper analyzes the memory behavior of emerging RMS （recognition, mining, and synthesis） workloads for future multl-core processors, including the working set sizes, data sharing behavior, and spatial data locality, which shows that these RMS workloads are memory intensive, with large working set sizes, a significant amount of data sharing, and strong strided access patterns. The LLC design space was then explored for multi-threaded RMS workloads and the potential architectural choices were discussed for future multi-core cache design based on the observations. The experimental results show that large DRAM caches can effectively satisfy the cache requirement caused by large working sets with a 128 MB DRAM cache significantly reducing the average L1 miss penalty by 18% ; that the shared cache provides better performance than the private cache at the LLC level with a 8 MB shared cache improving the cache performance by 25% compared with a private cache with the same size in total; and that stride based hardware prefetehing mechanism provides significant performance improvement by 25 %. Consequently, a memory hierarchy is given with a 128 MB DRAM cache, an 8 MB on die SRAM shared cache, and an 8-entry stride prefetcher for the RMS workloads.

作者林隽民陈彧李文龙乔林汤志忠

机构地区清华大学计算机科学与技术系英特尔中国研究中心

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2011年第8期1055-1062,1071,共9页 Journal of Tsinghua University(Science and Technology)

基金国家自然科学基金资助项目(60573100 60773149) 国家"八六三"高技术项目(2008AA01Z108) 国家"九七三"重点基础研究项目(2007CB310900)

关键词片上多处理器片上缓存负载分析访存性能 RMS负载 chip multiprocessor on-chip cache workload characterization memory performance RMS workload

分类号 TP393.03 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Duhey P. Recognition, mining and synthesis moves computers to the era of tera [J].Technology@Intel Magzine, 2005, 1: 1-10.
2Chen Y K, Hughes C, Lee V. Convergence of recognition, mining, and synthesis workloads and its implications [J]. Proceedings o f the IEEE, 2008, 96(5): 790-807.
3Chen Y, Li Q, Li W, et al. Media mining--Emerging tera-seale computing applications [J]. Intel Technology Journal, 2007, 11(3): 239-250.
4Bienia C, Kumar S, Singh P J, et al. The PARSEC benchmark suite : Characterization and architectural implications [C]// Proceedings 17th International Conference on Parallel Architectures and Compilation Techniques. Toronto, 2008= 72 - 81.
5Hughes C, Grzeszczuk R, Sifakis E, et al. Physical simulation for animation and visual effeets: Parallellzation and characterization for chip multiprocessors [C]// Proceedings of the 34th International Symposium on Computer Architecture. San Diego, 2007: 220-231.
6Hurley J. Ray tracing goes mainstream [J]. Intel Technology Journal, 2005, 9: 99-108.
7Chen Y, Diao Q, Dulong C, et al. Performance scalability of data-mining workloads in bioinformatics [J].Intel Technology Journal, 2005, 9(2) : 131 - 142.
8Luk C, Cohn R, Muth R, et al. Pin= Building customized program analysis tools with dynamic instrumentation [C]// Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Chicago, 2005:190 200.
9Jaleel A, Cohn R, Luk C, et al. CMPSim: A binary instrumentation approach to modeling memory behavior of workloads on CMPs [R]. Technical Report-UMDSCA 2006-01, 2006.
10Muralimanohar N, Balasubramonian R, Jouppi N. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0 [C]// Proceedings of 40th International Symposium on Microarchitecture. Chicago, 2,007:3 - 14.

1崔伟,周泉.高并发重负载网站架构浅析[J].科技创新与应用,2013,3(12):38-38.
2李国红,汪东升,刘振宇,李崇民,刘根贤,郭三川.核分组的多核处理器优化方法[J].计算机科学与探索,2014,8(4):385-396.
3王文义,董绍静.大规模并行处理系统及其程序设计方法研究——Cache缺失延迟、层次算法和可定域性[J].计算机研究与发展,1999,36(5):589-593. 被引量：5
4徐一帆.信息中心网络中的内置缓存技术研究[J].环球人文地理,2014,0(12X):233-234.
5黄世能,奚建清.分布数据缓存体系[J].软件学报,2001,12(7):1094-1100. 被引量：13
6山秀明,刘旸,张林,王磊,任勇,袁坚,王耀希.P2P应用系统用户共享行为的复杂网络模型[J].计算机应用研究,2008,25(6):1853-1855. 被引量：5
7李浩松,朱欣焰,李京伟,陈军.WebGIS空间数据分布式缓存技术研究[J].武汉大学学报（信息科学版）,2005,30(12):1092-1095. 被引量：32
8蒋竞,李勇军,冯沁原,黄鹏,代亚非.P2P环境下的基于多种用户共享行为的防污染方案[J].中国科学：信息科学,2010,40(10):1321-1337. 被引量：2
9胡敏.基于Seam的Web系统多层缓存策略设计与实现[J].信息通信,2011,24(6):64-65.
10吴亚洲.浅谈信息中心网络管理中的内置缓存技术应用[J].电脑知识与技术,2015,11(2X):37-38.

清华大学学报（自然科学版）

2011年第8期

浏览历史

内容加载中请稍等...

新兴多核工作负载访存行为的定量分析

参考文献13

相关作者

相关机构

相关主题

浏览历史