期刊文献+

VTFTR:高维胖树中的无死锁容错路由算法 被引量:1

VTFTR:Deadlock-Free Fault-Tolerant Routing Algorithm in k-Dimension Fat-Tree
下载PDF
导出
摘要 随着近年来高性能计算系统规模的急剧扩大,高性能互连网络的可靠性成为愈发重要的问题。高维胖树是一种结合了胖树与多维环网优点的网络拓扑结构,凭借其良好的可扩展性与网络性能在E级时代具有广阔的应用前景。然而,目前关于高维胖树中容错路由算法的相关研究较为有限,其可靠性问题亟待解决。为提高高维胖树拓扑在高性能互连网络中的容错能力,进一步提高对应超算系统的运行效率,提出一种用于高维胖树中叶交换机故障的容错路由算法VTFTR。该算法结合转向模型与虚通道切换的思想,通过严格控制报文在无故障路径与容错路径中的转向,使用少量的容错虚通道与额外跳步实现高维胖树中的无死锁容错。实验结果表明,在单点故障情况下,VTFTR算法的容错路径较对比算法有2~4个跳步的减少,在4096个节点规模的网络中,当叶交换机故障数量为10时,在故障叶交换机不同的分布情况下,该算法能够以1.4%~2.0%的吞吐率下降作为代价来保持全网无故障节点之间的互连。 With the recent rapid increase in the scale of high-performance computing systems,the reliability of highperformance interconnection networks has become a significant research problem.The k-dimension fat-tree is a topology network that combines the advantages of fat-tree topology and k-dimension torus architecture.Its excellent scalability and network performance have shown wide promising applications in the era of Exa-scale computing.However,current research on the fault-tolerant routing algorithm in high-dimensional fat trees is still relatively limited,and reliability issues still need to be addressed.This paper proposes a fault-tolerant routing algorithm called Virtual Turning Fault-Tolerant Routing(VTFTR)for leaf switch faults in the k-dimension fat-tree to improve the fault tolerance of k-dimension fat-tree topology in highperformance interconnection networks and further enhance the work efficiency of supercomputing systems.VTFTR combines the principles of the turning model and virtual channel switching.By strictly controlling the steering of messages in fault-free and fault-tolerant paths,high-dimensional fat trees can achieve deadlock-free fault tolerance with a few fault-tolerant virtual channels and additional hops.The experimental results show that in a single fault scenario,VTFTR can reduce between two and four hops in the fault-tolerant path compared to the existing algorithm.When the number of switch failures in the 4096-node scale network increases to 10,the network can achieve interconnection of fault-free nodes in the entire network at the cost of a 1.4%-2.0%throughput drop based on the different distributions of fault leaf switches in the network.
作者 刘博阳 胡舒凯 施得君 卢宏生 LIU Boyang;HU Shukai;SHI Dejun;LU Hongsheng(Strategic Support Force Information Engineering University,Zhengzhou 450001,China;Jiangnan Institute of Computing Technology,Wuxi,Jiangsu 214100,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
出处 《计算机工程》 CAS CSCD 北大核心 2022年第12期38-44,53,共8页 Computer Engineering
基金 国家重点研发计划(2021YFB0301000)。
关键词 高性能互连网络 高维胖树 容错路由算法 高性能计算 死锁预防 high performance interconnection network k-dimension fat-tree fault-tolerant routing algorithm high performance computing deadlock prevention
  • 相关文献

参考文献4

二级参考文献19

  • 1Bismita S.Jena,Cynthia Khan,Rajshekhar Sunderraman.High Performance Frequent Subgraph Mining on Transaction Datasets: A Survey and Performance Comparison[J].Big Data Mining and Analytics,2019,2(3):159-180. 被引量:3
  • 2Sere-Jacobsen F O, Lysne O, Skeie T. Combing source and dynamic fault tolerance[C]//Proc of the 18th International Symposium on Computer Architecture and High Performance Computing, 2006 : 151-158.
  • 3Leiserson C. Fat tree: universal networks for hardware effi- cient supercomputing[J]. IEEE Transaction on Computers, 1985,34(10) :892-901.
  • 4Top500 supercomputer sites[EB/OL]. [2011-12-01]. http: //www, top500, org.
  • 5Gemez C,Gomez M E,Lopez P,et al. A dynamic and compact fault-tolerant strategy for fat-tree[C]// Proc of NPC, 2008 : 802-817.
  • 6Lin X,Chuang Y, Huang T. A multiple LID routing scheme for fat-tree-based infiniband networks[C]// Proc of IPDPS, 2004 : 1-10.
  • 7Hu Nong-Da, Wang Da-Wei, Sun Ning Hui. Distributed dy- namic fault-torlerant routing in fat tree[J]. Chinese Journal of Computers, 2010, 33(10)..1799-1808. (in Chinese).
  • 8Simple Linux utility of resource management[EB/OL], [2011- 12-01]. https ://computing. llnl, gov/linux/slurm/.
  • 9Xin Y,Wichus N,Zhenhai D,et al. Oblivious routing for fat- tree based system area networks with uncertain traffic de- mands[C]//Proc of SIGMETRICS'07,2007:337--348.
  • 10杨学军,廖湘科,卢凯,胡庆丰,宋君强,苏金树.The TianHe-1A Supercomputer: Its Hardware and Software[J].Journal of Computer Science & Technology,2011,26(3):344-351. 被引量:19

共引文献12

同被引文献16

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部