摘要
现代的大规模云计算任务往往需要多个集群协作完成,因此,规划连接集群的网络、保证网络的性能具有重要意义。通过介绍一次集群间网络的性能问题及其解决过程的真实案例,对数据中心集群间网络性能优化进行了探讨。这个案例来源于一个用于大规模实时数据处理的数据中心,在一次大规模并发任务中发现,在带宽并未完全使用的情况下发生了严重的分组丢失现象。基于对拓扑结构的分析,对性能瓶颈进行了定位。通过搭建测试环境并进行实际测试,对性能瓶颈的来源进行了发掘,发现是某交换机在链路未满负荷时由于处理能力和缓存不足造成的。最后,基于模型,分析了各个要素对于性能的影响,并基于分析结果设计了基于增加帧长度的解决方案。
Modern large-scale cloud computing tasks often balance the load across multiple clusters. For this reason, to plan the network connecting the clusters and to provide satisfying network performance are of important significance. Based on the description of inter-cluster network performance problem and its resolving process, the inter-cluster network performance optimization was discussed. This case came from a data center for real-time large-scale data processing, where the bandwidth was not fully used but severe packet loss happened in case of large-scale concurrent tasks. Based on the analysis of topology, performance bottleneck was located. By building a test environment and the actual testing, the performance bottleneck was found: it was a switch lack of processing power and cache. Finally, the various elements related with the performance were analyzed, and a solution based on increasing the frame length was proposed.
出处
《电信科学》
北大核心
2015年第5期138-142,共5页
Telecommunications Science
关键词
集群网络
性能优化
吞吐量
cluster network, performance optimization, throughput