期刊文献+

延时敏感的推测多线程调度策略

Latency-aware thread scheduling scheme for thread-level speculation
下载PDF
导出
摘要 随着大规模片上多核处理器的发展,越来越多的核被集成到一个芯片上。一方面,总会有一些核处于空闲状态;另一方面,受功耗限制片上单核比较简单,导致单线程性能较弱。通过在片上多核处理器上支持推测多线程机制,可以利用空闲的片上资源来加速串行程序执行,提高单线程性能。决定推测多线程执行性能的一些额外开销,比如缓存缺失率上升、冲突检测开销、线程提交开销以及推测线程重新执行开销等,对片上多核处理器访存时延和核间通信时延非常敏感。传统的多线程调度算法因为没有考虑到推测多线程机制的特点,在用于推测多线程调度时效果不佳。提出的延时敏感的推测多线程调度算法,利用推测多线程在剖析、编译阶段产生的访存特性统计和实时访存记录,计算程序的数据重心,逐步将推测多线程调度到数据重心周围的相邻几个核中执行;同时,在推测线程调度过程中充分利用提交成功的线程和推测失败的线程留在缓存中的数据,提高缓存利用率。实验结果表明,推测多线程机制执行中,采用延时敏感的推测多线程调度策略相对于广泛采用的优先级调度策略能够取得平均16.8%的性能提升;相对于最近提出的基于非一致性数据访问优化的集群线程调度策略能够取得平均10.1%的性能提升。 With the advent of large-scale chip-multiprocessors (CMPs), more and more cores are in- tegrated on a single chip. On the first hand, there always will be some idle cores. And on the other hand, with the energy consumption limit, cores integrated on the chip are relatively simple. Thread- Level Speculation (TLS) remains a promising technique for exploiting the idle hardware resources to im- prove the performance of a sequential program. However, the usual distributed design of large-scale CMPs, like the non-uniform cache architecture (NUCA), introduces some non-uniform architecture- properties which significantly increase the overhead of TLS execution (L2 cache access overhead, task squashing overhead and re-execution overhead). Some state-of-the-art multithread scheduling algorithms work poorly for TLS because of ignoring these TLS-relative characteristics. The proposed latency-aware thread scheduling algorithm for thread-level speculation, uses the memory access statistics gained in the profiling, compiling and real-time executing stages, to calculate the CDG (Center of Data Gravity) of the program, and then schedules the speculative threads to the cores around the CDG. At the same time, the proposed thread scheduling algorithm makes good use of the data remained in the cache by the committed and squashed threads. Evaluation results show that latency-aware thread scheduling algo-rithm observed 16.8% performance speedup over priority scheduling, and 10.1% performance speedup over clustered-thread scheduling. Key words
出处 《计算机工程与科学》 CSCD 北大核心 2013年第11期14-21,共8页 Computer Engineering & Science
基金 国家863计划资助项目(2013AA01A215) 教育部-Intel信息技术专项科研基金资助项目(MOE-INTEL-11-04)
关键词 时延 片上多核处理器 推测多线程 线程调度 Latency chip multiprocessors thread-level speculationl thread scheduling
  • 相关文献

参考文献16

  • 1Akkary H, Driscoll M A. A dynamic multithreading proces- sor[C]//Proc of International Symposium on Microarchitec- ture, 1998:226- 236.
  • 2Dong Z, Zhao Y, Wei Y, et al. Prophet:A speculative multi threading execution model with architectural support based on CMP[C]//Proc of the 8th International Conference on Embedded Computing, 2009 : 103-108.
  • 3I.uo Y C,Zhai A. Dynamically dispatching speculative threads to improve sequential execution [J]. ACM Transactions on Architecture and Code Optimization (TACO), 2012, 9(3) : Article No 13.
  • 4Tremblay M, Chan J, Chaudhry S, et al. The MAJC archi- tecture: A synthesis of parallelism and scalahility [J]. Micro, IEEE, 2000, 20(6):12-25.
  • 5Zier D A, Lee B. Performance evaluation of dynamic specula- tive multithreading with the caseadia architecture [J]. IEEE Transactions on Parallel and Distributed Systems, 2010, 21 (1),47-59.
  • 6Steffan J G, Colohan C B, Zhai A, et al. A scalable ap- proach to thread-level speculation [J]. ACM SIGARCH Computer Architecture News, 2000, 28(2): 1-12.
  • 7Viiaykumar T N, Gopal S, Smith J E, et al. Speculative ver- sioning cache[J]. IEEE Transactions on Parallel and Dis tributed Systems, 2001, 12(12):1305-1317.
  • 8l.iu W, Tuck J, Ceze L, et al. POSH:A TLS compiler that exploits program structure [C]//Proc of the 11th ACM SIG PLAN Symposium on Principles and Practice of Parallel Pro- gramming, 2006 : 158-167.
  • 9Cintra M,Martinez J F, Torrellas J. Architectural support for scalable speculative parallelization in shared-memory mul- tiprocessors[C]//Proc of the 27th Annual International Sym- posium on Computer Architecture, 2000 : 13 -24.
  • 10Kim C, Burger D, Keckler S W. Nonuniform cache archi- tectures for wire-delay dominated on chip caches [J]. Mi- cro, IEEE, 2003, 23(6):99-107.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部