面向DSWP并行的OpenMP任务调度机制的扩展与实现被引量：2

Extension to OpenMP Task Scheduling Mechanism for DSWP Parallelization and its Implementation

下载PDF

导出

摘要多核处理器能够提升多线程程序的性能,但早已存在的诸多单线程程序无法从中获益,程序员也习惯于编写单线程程序。自动并行化技术是将单线程程序移植到多核上的重要手段,但是当循环中存在无法确定的数据依赖或复杂的控制流时,传统的自动并行化技术无法取得良好效果。Ottoni等人针对传统自动并行失败的循环提出了Decoupled Software Pipelining(DSWP)算法用以实现指令级的细粒度并行,但其需要对处理器体系结构的深入了解以及对核间通信队列和专用指令的硬件支持,并行性能和应用广泛性受到限制。基于OpenMP应用编程接口实现的DSWP并行不依赖于硬件上对核间通信队列和专用指令的支持,且不受平台的限制,但现有的OpenMP任务调度机制无法满足DSWP并行中对任务调度的需求。对现有的OpenMP任务调度机制进行扩展,增加了任务与线程绑定的属性,保证了基于OpenMP的DSWP并行程序的正确执行。在GCC的OpenMP运行库libgomp中扩展了任务绑定属性子句的功能,扩展后的GCC作为OpenMP DSWP程序的基础编译器,为自动并行提供支持。通过对基准测试集NPB3.3.1的测试表明,传统自动并行失败的循环,经OpenMP DSWP自动并行后在双核处理器上平均加速比达到1.23以上;使用添加了OpenMP DSWP算法的Open64编译器生成的并行程序,与仅使用传统自动并行方法的Intel编译器和Open64编译器所得程序相比,平均加速比分别高出22%和26%。 While multicore processors increase throughput for multi-programmed and multithreaded codes, many impor-tant applications are single threaded and thus are not benefited. Automatic parallelization techniques play an important role in migrating singe threaded applications to multicore platform. Unfortunately, the prevalence of control flow, recur-sire data structures, and general pointer accesses in ordinary programs renders the existing techniques unsuitable. Ottoni et al. proposed an automatic parallelization algorithm called Decoupled Software Pipelining （DSWP）to exploit fine- grained pipeline parallelism at the instruction level. But it requires knowledge of micro-architectural properties and hard- ware support of a communication channel and two special instructions. The improved DSWP algorithm based on OpertMP increases the parallel granularity and does not rely on hardware support any more, but the existing OpenMP task scheduling mechanism cannot satisfy the need of DSWP. A new binding clause for the task construct in OpenMP was proposed to extend the task scheduling mechanism. It guarantees the correctness of the OpenMP DSWP paralleliza-tion. The new clause is implemented in the GCC runtime library libgomp, which provides support for the compilation of OpenMP DSWP programs. The experimental results show that loops failed to be parallelized by existing techniques can be parallelized by the improved automatic parallelization algorithm and gain significant performance improvement on du-al-core CPU. The average performance speedup is up to 1.23. Compared with Intel and Open64 compilers, the compiler with the improved algorithm can increase execution efficiency evidently and the average speedup of the OpenMP DSWP programs generated by it increases more than 22% and 26%.

作者刘晓娴赵荣彩丁锐

机构地区解放军信息工程大学

出处《计算机科学》 CSCD 北大核心 2013年第9期38-43,共6页 Computer Science

基金国家"核高基"重大专项(2009ZX01036-001-001-2)资助

关键词自动并行化 OPENMP DSWP 任务调度机制 GCC Automatic parallelization, OpenMP, Decoupled software pipelining, Task scheduling mechanism, GCC

分类号 TP314 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1Benoit A,Melhem R,Renaud-Goud P,et al.Power-aware Manhattan routing on chip multiprocessors[C]∥Proceedings of 26th International Parallel and Distributed Processing Symposium.Shanghai,2012:189-200.
2Jin Hao-qiang,Jespersen D,Mehrotra P,et al.High performance computing using MPI and OpenMP on multi-core parallel systems[J].Parallel Computing,2011,37(9):562-575.
3丁锐,赵荣彩,韩林.基于主导值的计算和数据自动划分算法[J].计算机科学,2012,39(3):290-294. 被引量：5
4Allen R,Kennedy K.Optimizing compilers for modern architectures:a dependence-based approach[M].California:Morgan Kaufmann Publisher,2001:63-68.
5Lin Yu-te,Wang Shao-chung,Shih Wen-li,et al.Enable OpenCL compiler with Open64infrastructures[C]∥Proceedings of 13th IEEE International Conference on High Performance Computing and Communications.Alberta,2011:863-868.
6Gerber R,Smith K B,Bik A J C,et al.The sofware optimization cookbook:high-performance recipes for IA-32platforms(2st ed)[M].Hillsboro:Intel Press,2006:13-27.
7Ottoni G,Rangan R,Stoler A,et al.Automatic thread extraction with decoupled software pipelining[C]∥Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture.Washington,DC,2005:105-118.
8August D I,Connors D A,Mahlke S A,et al.Integrated predication and speculative execution in the IMPACT EPIC architecture[C]∥Proceedings of the 25th International Symposium on Computer Architecture.Barcelona,1998:227-237.
9富弘毅,丁滟,宋伟,杨学军.一种利用并行复算实现的OpenMP容错机制[J].软件学报,2012,23(2):411-427. 被引量：7
10Thoman P,Jordan H,Pellegrini S,et al.Automatic OpenMPloop scheduling:a combined compiler and runtime approach[C]∥Proceedings of 8th International Workshop on OpenMP.Rome,2012:88-101.

二级参考文献25

1夏军,杨学军.基于数据空间融合的全局计算与数据划分方法[J].软件学报,2004,15(9):1311-1327. 被引量：7
2Anderson J M. Automatic computation and data decomposition for multiproeessors [D]. US: Stanford University, 1997.
3Kennedy K,Kremer U. Automatic Data Layout for Distributed- Memory Machines [J]. ACM Transactions on Programming Languages and Systems, 1998,20 (4) : 869-916.
4韩林.面向分布存储结构的并行分饵一致性优化技术研究[D].郑州:解放军信息工程大学,2008.
5Lee Pei-zong. Automatic data and computation decomposition on distributed memory parallel eomputers[J]. ACM Transactions on Programming Languages and Systems, 2002,24 (1) : 1-50.
6Brian S A. Enabling automatic parallelization of industrial-grade applications [D]. US: Purdue University, 2010.
7Zhen Cao, Yuan Dong, Wang Sheng-yuan. Domain-specific pattern matching based automatic parallelization demonstrated by 2-D prestack migration: Parallel and Distributed Systems(IC- PADS) [C]// 15th International Conference. Shenzhen, China, 2009:973-980.
8Anderson J M, Lam M S. Global optimizations for parallelism and locality on scalable parallel machines [C] // Proceedings of the ACM SIGPLAN' 93 Conference on Programming Language Design and Implementation. Albuquerque, New Mexico, USA, 1993:112-125.
9Anderson J M, Lam M S. Data and computation transformations for mulfiprocessors[C]//Proceedings of the 5th ACM/SIGP- LAN Symposium on Principles and Practice of Parallel Programruing. Santa Barbara,California, USA, 1995 : 166-178.
10Lim A W, Lain M S. Maximizing parallelism and minimizing synchronization with affine transforms[C]//Proceedings of the Conference Record of the 24th ACM SIGPLAN/SIGACT Symposium on Principles of Programming Languages. Paris, France, 1997:201-214.

共引文献10

1丁锐,赵荣彩,刘晓娴,傅立国.自动并行化中不规则问题的划分方法[J].信息工程大学学报,2013,14(2):235-242. 被引量：1
2刘晓娴,赵荣彩,丁锐,李雁冰.基于循环分块的流水粒度优化算法[J].计算机应用,2013,33(8):2171-2176. 被引量：1
3汪建军,于策,孙济洲,孙超,金舟,成钢.P2P多线程动态容错模型的研究与应用[J].计算机工程,2013,39(9):104-108.
4袁功彪,杨金民,白树仁.基于并发性发掘的低开销回卷恢复实现方法[J].计算机工程,2013,39(11):46-51.
5丁锐,赵荣彩,韩林.一种基于数组生命期的数据分解算法[J].软件学报,2013,24(12):2843-2858.
6丁锐,赵荣彩,徐金龙,傅立国.自动并行化中不规则循环的代码生成[J].计算机科学,2013,40(12):9-14.
7傅立国,姚远,丁锐.自动并行化中不规则循环的通信代码生成[J].计算机应用,2014,34(4):1014-1018.
8曾喜良,彭浩.容错机制的异构分布式系统安全可靠调度研究[J].网络安全技术与应用,2015(7):61-62.
9刘洋,杨金民.OpenMP程序中基于活跃变量分析的检查点优化[J].计算机工程与应用,2016,52(4):31-41.
10葛优,金大海,宫云战.基于OpenMP的并行Fortran程序数据竞争静态检测方法[J].小型微型计算机系统,2023,44(11):2377-2383.

同被引文献10

1刘飞.基于时钟驱动的循环调度[J].航空计算技术,2006,36(6):125-129. 被引量：1
2刘胜飞,张云泉,孙相征.一种改进的OpenMP指导调度策略研究[J].计算机研究与发展,2010,47(4):687-694. 被引量：15
3王桂彬,杨学军,徐新海,林一松,李鑫.异构系统功耗感知的并行循环调度方法[J].软件学报,2011,22(9):2222-2234. 被引量：7
4董勇,陈娟,杨学军.改进的能量最优OpenMP静态调度算法[J].软件学报,2011,22(9):2235-2247. 被引量：1
5岑博文,陈邦兴,万勇兵,靳庆庆.列控系统临时限速服务器仿真测试平台研究[J].计算机工程与设计,2012,33(1):372-376. 被引量：8
6王乾宇,朱小冬,王毅刚,周鹏.嵌入式软件仿真测试环境实时任务调度的研究[J].计算机测量与控制,2012,20(5):1162-1165. 被引量：2
7蒋溢,聂路雨.基于动态权值优先级队列的移动消息推送策略[J].计算机工程与设计,2013,34(10):3520-3524. 被引量：4
8刘辉,金茂忠.基于全数字仿真的嵌入式软件测试技术[J].北京航空航天大学学报,2014,40(3):394-400. 被引量：2
9黎晖,石小华,林柯军,姚雪梅.工程装备嵌入式软件测试环境平台技术研究[J].计算机测量与控制,2016,24(4):10-12. 被引量：6
10夏佳佳,邹毅军,周江伟,王天民,曹胜莉.嵌入式软件自动化测试系统研究[J].计算机测量与控制,2016,24(4):22-25. 被引量：20

引证文献2

1王影,刘卉,赵娟.软件部件仿真测试平台的设计与实现[J].计算机工程与设计,2017,38(11):3061-3065. 被引量：5
2张庆花,赵荣彩,张素平,丁丽丽,王鹏翔.面向非规则Doacross循环的反馈式编译框架[J].信息工程大学学报,2018,19(1):100-105.

二级引证文献5

1钱汉伟.GAT:Windows平台下GUI软件自动化测试框架研究[J].软件,2018,39(3):72-76. 被引量：3
2王影.软件集成测试平台的研究与设计[J].计算机工程与设计,2018,39(8):2675-2681. 被引量：3
3王影.可移植的软件集成测试平台设计与实现[J].微电子学与计算机,2019,36(11):37-42. 被引量：1
4崔孟暄,王佳浚.一种基于非侵入测试技术的专利申请客户端单账户虚拟共享方法[J].科技智囊,2020(12):65-68.
5刘涛,卢希,冯飞,王月波.航空机载软件全数字仿真测试系统的设计与实现[J].电讯技术,2022,62(3):317-322.

1刘晓娴,赵荣彩,韩林,李雁冰.一种基于OpenMP的DSWP自动并行算法[J].信息工程大学学报,2015,16(2):225-233. 被引量：1
2王博,陈莉君.JAVA BEAN的分析和应用[J].西安文理学院学报（自然科学版）,2008,11(1):92-96.
3王小红.大数据时代下的计算机信息处理技术研究[J].科技创新与应用,2016,6(10):78-78. 被引量：5
4A.Chakravarti,P.S.Ivanov,J.Rorison.使用并行MATLAB扩展桌面计算环境[J].中国电子商情,2010(4):38-39.
5黄春,刘勇鹏,杨学军.面向OpenMP的混合检查点机制[J].计算机科学与探索,2007,1(2):191-199.
6杨子煜,严明,赵鹏.基于多核阵列体系结构的嵌套循环并行优化[J].计算机工程与科学,2009,31(A01):125-128.
7郑勇,费凌,翟元义.基于ARM的数字图像采集与传输系统设计[J].工业控制计算机,2010,23(7):84-85. 被引量：1
8李中升,张海军.Open64编译器的可移植性分析[J].高性能计算技术,2003,0(1):53-56. 被引量：1
9殷小科,湛茂溪.大规模数据比对的新型实现方法[J].电脑知识与技术,2012,8(5):2975-2978.
10刘晓娴,黄品丰.面向异构系统的OpenMP程序自动生成[J].信息工程大学学报,2012,13(4):489-495. 被引量：3

计算机科学

2013年第9期

浏览历史

内容加载中请稍等...

面向DSWP并行的OpenMP任务调度机制的扩展与实现被引量：2

参考文献12

二级参考文献25

共引文献10

同被引文献10

引证文献2

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

面向DSWP并行的OpenMP任务调度机制的扩展与实现 被引量：2

参考文献12

二级参考文献25

共引文献10

同被引文献10

引证文献2

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

面向DSWP并行的OpenMP任务调度机制的扩展与实现被引量：2