期刊文献+

领域专用低延迟高带宽TCP/IP卸载引擎设计与实现 被引量:6

Design and Implementation of Domain-Specific Low-Latency and High-Bandwidth TCP/IP Offload Engine
下载PDF
导出
摘要 针对量化高频交易应用场景对数据传输低延迟高带宽的需求,定制一种领域专用的TCP/IP协议栈,并将其卸载到专用硬件加速模块上。采用模块化设计实现专用硬件逻辑,并与FAST协议硬件加速模块共同构成完整的低延迟高带宽高频交易系统。通过调整最大报文长度,实现64 Byte数据对齐,提升内核与高带宽内存(HBM)间的读写速率,并对内存结构进行优化,实现主机端与HBM间的4通道并行读写管理。对各功能模块进行数据流优化,最终构建全流水线架构。模块间统一使用AXI4-Stream接口连接,并绕过内存进行数据传输,实现传输性能的提升。实验结果表明,TCP/IP卸载引擎在Xilinx Alevo U50数据中心加速卡上可获得38.28 Gb/s的网络吞吐率,基础网络通信穿刺延迟最低为468.4 ns,在叠加FAST解码协议后延迟为677.9 ns,与传统软件处理网络堆栈(Intel i9-9900x+9802BF)的方式相比,TCP/IP引擎的吞吐率提升1倍,延迟降低为1/12,且延迟稳定,波动范围在10 ns左右,在满足量化高频交易场景需要的同时,有效减轻了CPU的负载。 In response to the low-latency and high-bandwidth requirements for data transmission in quantitative highfrequency trading application scenarios,a domain-specific Transmission Control Protocol/Internet Protocol(TCP/IP)protocol stack has been customized and offloaded to a dedicated hardware acceleration module. A modular design is adopted to realize the special hardware logic,and together with the fast protocol hardware acceleration module,a complete high-frequency trading system with low delay and high bandwidth is built.By adjusting the Maximum Segment Size(MSS),64 Byte data alignment is achieved,the read/write speed between the kernel and High Bandwidth Memory(HBM) is improved,and the memory structure is optimized to realize a 4-channel parallel read/write management between the host and the HBM.The data flow of each functional module and the data for verification and calculation are optimized,and finally a full pipeline architecture is built.The AXI4-Stream interface is used to connect the modules,by passing the memory for data transmission and improving the transmission performance. The experimental results show that the TCP/IP offload engine can obtain a network throughput of 38.28 Gb/s on Xilinx Alevo U50 data center accelerator card,with the lowest basic network communication puncturing delay of 468.4 ns,and the delay of 677.9 ns after the fast decoding protocol is superimposed. Compared with the traditional software processing network stack(Intel i9-9900x+9802BF),the throughput of the TCP/IP engine is increased by one time,the delay is reduced to 1/12,and the delay is stable,with a fluctuation range of approximately 10 ns.While meeting the needs of quantifying high-frequency trading scenarios,it effectively reduces the payload on the CPU.
作者 冯一飞 丁楠 叶钧超 柴志雷 FENG Yifei;DING Nan;YE Junchao;CHAI Zhilei(School of Internet of Things Engineering,Jiangnan University,Wuxi,Jiangsu 214122,China;School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi,Jiangsu 214122,China;Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence,Wuxi,Jiangsu 214122,China)
出处 《计算机工程》 CAS CSCD 北大核心 2022年第9期162-170,共9页 Computer Engineering
基金 国家自然科学基金(61972180)。
关键词 领域专用 传输控制协议/互联网协议卸载引擎 高带宽低延迟 可编程逻辑门阵列 开放运算语言 domain-specific Transmission Control Protocal/Interner Protocal(TCP/IP)offload engine low-latency and high bandwidth Field Programmable Gate Array(FPGA) Open Computing Language(OpenCL)
  • 相关文献

参考文献1

二级参考文献4

共引文献10

同被引文献70

引证文献6

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部