摘要
未来100P/E级高性能计算机系统对网络的传输可靠性、性能均衡性、可扩展性方面有更高的需求。本文提出的RDMA传输模型,采取配置少量资源,动态连接使用的策略实现端到端的数据可靠传输。与传统的可靠通信协议如Infiniband相比,本方案的优势为:(1)支持自动重路由,可绕过网络故障区域保证消息的可靠传输;(2)支持报文乱序到达,支持源和目的间的多路径传输,提供消息的流控机制,能较好地均衡网络整体性能,减少网络热点和缓解网络拥塞;(3)基于通信接口硬件实现可靠性数据结构,不需要消耗主存为通信建立连接,具有极高的系统可扩展性。初步测试结果表明,采取了优化措施后,该协议不会增加小于4K字节消息的传输延迟。
Upcoming 100 Petascale/Exascale Supercomputers will demand highly reliable, well balanced and highly scalable interconnection networks. Our RDMA transport model implements an end-to- end reliable communication protocol by a small quantity of resources configuration and the dynamic connection strategy. Unlike the conventional implementations such as Infiniband, the proposed scheme has superior attributes in terms of a) being able to recover network failures by changing route automatically; b)being able to handle the packets coming out of order and use multiple paths between the source and destination nodes,providing message flow control,all of these measures can reduce the network hot spot and congestion;c)the reliability resources are implemented in hardware, not consuming the memory for connection, so it has good system scalability. The experimental results show that our optimized scheme does not increase the latency of the messages whose size is below 4k bytes.
出处
《计算机工程与科学》
CSCD
北大核心
2012年第8期184-190,共7页
Computer Engineering & Science
基金
国家863计划资助项目(2012AA01A301)
国家自然科学基金资助项目(61003301)