摘要
随着多核、众核技术的飞速发展,超级计算机中的进程数急剧增加,E级计算的进程数将达到数十万量级,利用RC、XRC等IB可靠传输服务在所有进程间建立全连接存在严重的可扩展性问题。本文提出了一种扩展的不可靠数据报传输服务XUD,由硬件提供简单的ACK应答机制和类UD编程语义,由软件在Verbs库中采用轻量级滑动窗口实现消息重传和排重,软硬件协作共同实现消息的高效可靠传输。相较于RC,XUD的内存消耗显著减少,在131072个进程间建立全连接时每个处理器的内存开销由3.1GB降低到372MB。另外,软件实现可靠传输带来的开销非常低,消息延迟仅增加约0.05usec,带宽仅下降约70MB/s。因此利用XUD实现大规模互连具有明显优势。
With the rapid development of multi-core and many-core processors, the count of cores and processes in single super computer increases drastically, and in the exa-scale computing it could reach the magnitude of several thousands, which could result in serious scalability problem when using IB reliable transport service such as RC or XRC to establish all-to-all process connectivity. We proposed a new transport service named e Xtended Unreliable Datagram. In the XUD service, the hardware provides weak and simple ACK mechanism and programming semantics similar to UD, and the Verbs library provides the data deliveryguarantee and duplicated packets detection capability by light weight sliding window mechanism. So the XUD could support complete reliable transport service by the hardware and software cooperating with each other. Compared to RC, the memory usage can reduces rapidly by using XUD. When establish allto-all process connectivity for 131072 processes, the memory usage reduces from RC's 3.1 GB to XUD's 372 MB. Meanwhile the software overhead for implementing reliable service is very low, in which the message latency increases by 0.03 usec and the bandwidth decreases by 70 MB/s only. Compared to RC and UD, XUD has notable advantage when implement very large all-to-all connectivity.
作者
陈淑平
彭龙根
Chen Shuping;Peng Longgen(Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214000, China)
出处
《科研信息化技术与应用》
2017年第6期13-20,共8页
E-science Technology & Application
基金
国家高技术研究发展计划(2015AA015306)