摘要
业务上云在近些年已经成为趋势,而新冠疫情也加速了这一趋势.然而公有云并不适用于所有用户.尤其是出于数据隐私的考虑,很多用户尤其是政府用户更希望在后疫情时代建设他们自己的私有云或者混合云.超融合设备(HCI)是达到这一目标的有效手段.在超融合设备中,计算、网络、存储等资源都被完全虚拟化,传统的物理网络设备单元也被一段段代码所替代.此外为了获得高性能的网络转发能力,很多创新技术应运而生,其中DPDK技术是其中翘楚而被广泛应用.开发者可以利用DPDK技术实现多种多样地、定制化地网络转发应用.虚拟化技术和DPDK技术可以大大提升设备资源的利用率以及网络转发性能,降低大中小企业或者机构的数据中心或者私有云的构建难度和成本.但同时高度的虚拟化也给网络运维人员带来了巨大的挑战.这些虚拟网元对网络运维人员而言是没有实体的,虚拟网络在运维人员看来就像一个“黑盒”.当网络出现故障时(如丢包),传统的针对物理网络设备的排障手段在虚拟网络中变得不可用,这就大大增加了网络排障的时间,进而对业务的持续运行造成影响.针对这种问题,设计了一种虚拟网络持续性丢包探测系统Flowprobe,该系统旨在解决基于DPDK用户态虚拟网络的持续性丢包检测及根因定位问题.通过该系统,用户可以观测数据包在虚拟网络中的详细路径、经历的转发行为,定位丢包的位置,获知丢包的原因.实验表明,该系统可以针对576种虚拟网络持续丢包场景进行检测以及给出问题根因,并且该系统做到了对正常转发业务的无影响,性能测试表明,该系统开启以后,对用户正常业务的转发影响可以控制在1%以内.该系统已经在超融合生产环境持续运行了3年,帮助用户以及网络运维人员解决了诸多虚拟网络故障问题.
Moving business to cloud has been a trend recently, and COVID-19 gives a push to this trend. However, not all forms of business are suitable for public cloud computing. For the sake of data privacy, plenty of users, especially government users, prefer to build their own private cloud or hybrid cloud in the post-COVID-19 world, and hyperconverged infrastructure(HCI) is a convenient way to achieve this goal. In HCI, computing, storage, and network are all virtualized, which leads to higher resource utilization and easier way to be deployed. The network elements are no longer present as sensible hardware blocks in HCI but as lines of codes to function instead. To achieve better data forwarding performance in virtualization, many innovative technologies have risen, among which DPDK has been widely studied and applied. With DPDK, developers can customize various network forwarding applications. Virtualization and DPDK can greatly improve resource utilization and network forwarding performance, reducing the difficulties and costs of building data centers or private cloud by enterprises of various scales or institutions. However, virtualization at a high level also poses great challenges to network operation and maintenance owing to the loss of physical network entities. When a virtual network suffers a failure(e.g., packet loss), the traditional diagnosis tools designed for hardware network equipment cannot fulfill the need of cause locating and analyzing, resulting in much more mean time to repair(MTTR)and business loss. Even worse, the virtual network seems like a black box to network operators, which makes the network vulnerable. To solve these problems, this study proposes a proactive diagnostic system for persistent packet loss in HCI based cloud, named Flowprobe, which aims to enable the detection and cause locating of persistent packet loss for userspace virtual networks based on DPDK. With this system, users can have a comprehensive view of the way in which the packet traverses through the virtual network, the actions that the packet has performed, the positions that suffer packet loss, and the causes resulting in the loss. Thoughtful evaluation has proven that the system can handle 576 packet loss scenarios in virtual networks. Meanwhile, it has a good performance, with the performance degradation of data forwarding not exceeding 1% when the system is functioning. The system has been deployed in the HCI production environment for about 3 years and helped solve many problems in virtual networks.
作者
李德方
古亮
闫争争
陈晓帆
LI De-Fang;GU Liang;YAN Zheng-Zheng;CHEN Xiao-Fan(Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518055,China;Sangfor Technologies,Shenzhen 518055,China)
出处
《计算机系统应用》
2022年第9期99-113,共15页
Computer Systems & Applications
关键词
虚拟化
软件定义网络(SDN)
云计算
超融合
DPDK
故障诊断
微服务框架
virtualization
software-defined network(SDN)
cloud computing
hyper-converged infrastructure
data plane development kit(DPDK)
fault diagnosis
microservice framework