一种Nehalem平台上的MPI多级分段归约算法

A Hierarchical-segment Reduction Algorithm for Nehalem Systems in Threaded MPI

下载PDF

导出

摘要基于线程MPI环境提出一种适用于Nehalem平台长消息归约的多级分段归约算法(HSRA).HSRA考虑了Nehalem系统的体系结构特点,分处理器内归约和处理器外归约两个步骤实施节点内归约通信,在均匀分布计算负载的前提下仅需要较少的远端内存访问.首先在MPIActor的归约算法框架中设计、实现了HSRA算法,从访存角度分析了HSRA算法的开销,然后与单级分段和已有的另外三种节点内基于共享内存的归约算法进行比较;最后在真实系统上通过IMB(Intel MPI Benchmark)验证算法,实验结果表明,该算法是一种适用于在Nehalem系统中处理长消息节点内归约的高效算法. A new intra-node reduction algorithm called Hierarchical-Segment Reduction Algorithm（HSRA） is proposed, which is for Nehalem systems based on threaded MPI environment. By considering the character of Nehalem micro architecture, HSRA imple- ments a intra-node reduction communication in two steps which refers as inter-processor reduction and outer-processor reduction, the design can balance computing loads with less remote memory access. First in MPIActor reduction algorithm framework implementing HSRA based on access and storation, then comparing of reduction algorithm with single segment and other three nodes, finally testing the algorithm in Intel MPI Benchmark. The experiment shows that HSRA is an effective algorithm for long message reduction on Nehalem systems.

作者邹金安刘志强廖蔚

机构地区福建省莆田学院电子信息工程系国防科技大学计算机学院

出处《小型微型计算机系统》 CSCD 北大核心 2012年第4期733-738,共6页 Journal of Chinese Computer Systems

基金福建省科技厅重大项目(2010H6019)资助福建省莆田市科技计划项目(2010G09)资助

关键词多级分段归约算法 MPI HSRA NEHALEM MPI归约 MPI全归约 hierarchical-segment reduction algorithm MPI HSRA Nehalem MPI_reduce MPI_aUreduce

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献12

1Mamidala A R, Kumar R, De D, et al, MPI collectives on modem multicore clusters: performance optiraizations and communication characteristics[ C]. Cluster Computing and the Grid, CCGRID'08, 8th IEEE International Symposium on, 2008:130-137.
2Chan E W, Heimlich M F, Purkayastha A, et al. On optimizing collective communication [ C ]. Conference Location : Conference Name, 2004 : 145-155.
3Graham R L, Shipman G. MPI support for multi-core architectures : optimized shared memory collectives [ C ]. Conference Location: Conference Name, 2008:130-140,.
4OpenMPI websit[EB/OL], http://www, open-mpi, org,2011-02-10.
5MPICH2 websit [ EB/OL]. http://www, mcs. anl. gov/mpi/ mpich2, 2011-02-01.
6刘志强,宋君强,卢风顺,赵娟.基于线程的MPI通信加速器技术研究[J].计算机学报,2011,34(1):154-164. 被引量：12
7Patarasuk P,Yuan X. Bandwidth optimal all-reduce algorithms for clusters of workstations [ J ]. J. Parallel Distrib. Comput. , 2009,69(2) :117-124.
8Ritzdorf H,Triff J L. Collective operations in NEC's high-perform- ance MPI libraries [ C ]. Conference Location: Conference Name, 2006.
9top 500 websit [ EB/OL]. http ://www. top500, org ,2011-02-01.
10MVAPICH2 websit [ EB/OL]. http://mvapich, cse. ohio-state. edu ,2011-01-23.

二级参考文献9

1Chai L, Gao Q, Panda D K. Understanding the impact of multi core architecture in cluster computing: A case study with InteI Dual Core system//Proceedings of the CCGrid'07. Rio de Janeiro, Brazil, 2007:471 -478.
2Tang H, Shen K, Yang T. Program transformation and runtime support for threaded MPI execution on shared memory machines. ACM Transactions on Programming Languages and Systems, 2000, 22(4): 673- 700.
3Demaine E D. A threads only MPI implementation for the development of parallel programs//Proceedings of the Ilth In ternational Symposium on High Performance Computing Sys terns. Winnipeg, Manitoba, Canada, 1997:153-163.
4Prakash S, Bagrodia R. MPI -SIM: Using parallel simulation to evaluate MPI programs//Proceedings of the Winter Simula tion. Los Aamitos, CA, USA, 1998:467- 474.
5Saini S, Naraikin A et al. Early performance evaluation of a Nehalem" cluster using scientific and engineering applications//Proceedings of the SC'09. New York, USA, 2009, Article 21,12 pages.
6Diaz Martin J C, Rico Gallego J A et al. An MPI -1 corn pliant thread based implementation//Proceedings o{ the EuroPVM/ MP1 2009. Berlin, Heidelberg, 2009:327- 328.
7Sade Y, Sagiv S, Shaham R. Optimizing C multithreaded memory management using thread local storage//Proceedings of the CC'05. Berlin, Heidelberg, 2005:137-155.
8Jin H W, Sur S, Chai L, Panda D K. LiMIC: Support for high-performance MPI Intra Node communication on Linux cluster//Proceedings of the ICPP'05. Washington, DC,USA, 2005, 184- 191.
9Moreaud S, Goglin B, Goodell D, Namyst R. Optimizing MPI communication within large multicore nodes with kernel assislance//Proceedings of the Workshop on Communication Ar chitecture for Clusters, held in Conjunction with IPDPS 2010. Atlanta, USA, 2010.

共引文献11

1祝永志,张丹丹,曹宝香,禹继国.基于SMP机群的层次化并行编程技术的研究[J].电子学报,2012,40(11):2206-2210. 被引量：9
2祝永志,王喜燕.一种基于大同步并行编程模式的N体问题的优化实现[J].电子技术（上海）,2015,0(2):28-32.
3白文若,汪宁渤,朱均超,张宝峰.利用MPI构建柔性数据处理系统[J].计算机应用与软件,2015,32(9):38-41. 被引量：1
4祝永志,续士强,禹继国.基于OpenMP/MPI并行编程模型的N体问题的优化实现[J].计算机工程与应用,2016,52(5):16-21. 被引量：1
5芮国俊,王婷.嵌入式VxWorks系统的MPI实现[J].信息技术,2016,40(4):196-200.
6黄伟华,马中,戴新发,徐明迪,高毅.着色Petri网对高性能集群的建模与性能评估[J].计算机系统应用,2017,26(5):35-42.
7邹国良,陈长吉,郝剑波.构建适用于深度学习的海浪样本数据集的并行算法实现及性能优化[J].计算机应用与软件,2017,34(9):57-63. 被引量：2
8吴恩慈.基于优化的CAS算法实现线程安全的HashMap[J].软件,2019,40(6):185-190. 被引量：2
9马震太,张晓梅,孙功星.BESⅢ实验软件事例级并行化研究[J].计算机工程与应用,2021,57(20):253-262.
10刘婷,安萍,芦韡,秦志红.基于OpenMP的堆芯中子学软件性能优化研究[J].中国核电,2024,17(2):190-196.

1老道,陈潇恺.HSR＝显示速度的提升？[J].微型计算机,2001(6):92-93.
2凌华科技推出最新ATX工业规格主板IMB-M42H[J].测控技术,2015,34(11):158-159.
3周旭文,陈吉红.用高级语言开发IBM　PC硬件中断[J].计算机工程,1994,20(2):45-47.
4小团子.早买早享受——要805，还是等915？[J].微型计算机,2006(16):135-135.
5杨君.Intel发布最新PⅢXeon处理器[J].金融科技时代,1999,0(10):76-76.
6王韶娟,曾国荪.分形维数的一个并行算法[J].计算机应用与软件,2005,22(10):19-20. 被引量：2
7史维.基于MPI环境的并行算法在有限元分析中的应用与研究[J].内蒙古石油化工,2008,34(17):5-7. 被引量：2
8杜晋瑞,戴光明.Bp算法在MPI负载平衡中的应用[J].电子与电脑,2005,5(6):108-110. 被引量：1
9Maxa 电力专用计算机PRP／HSR[J].自动化博览,2015,0(1):8-8.
10郑晓东.IBM服务器双机热备解决方案[J].西铁科技,2002(2):40-41.

小型微型计算机系统

2012年第4期

浏览历史

内容加载中请稍等...

一种Nehalem平台上的MPI多级分段归约算法

参考文献12

二级参考文献9

共引文献11

相关作者

相关机构

相关主题

浏览历史