摘要
【目的】在大数据处理领域,分布式计算系统得到广泛应用,它们的可扩展性得到重点关注,但其绝对性能往往没有得到重视。我们希望提出科学合理、与时俱进的度量标准,对分布式系统的性能进行评估。【方法】本文通过对比特定任务的单机实现和分布式实现来讨论分布式系统的性能,提出COS(Configuration that Outperforms a Single machine)这一指标,来衡量分布式系统在达到单台机器的性能时,需要的硬件资源数量。我们选取k-means聚类和逻辑回归两个经典机器学习算法,对其进行单机多线程实现,并通过向量化计算、优化内存分配与访问等方式对性能进行了优化,为分布式多机系统的性能提供参考。【结果】以Apache Spark作为对标系统,实验发现无论是使用其原生编程接口,还是经过悉心优化的机器学习库,都要使用数倍甚至数百倍的机器,才能达到单机多线程实现的性能。【局限】分布式系统与单机实现进行性能对比并不是完全公平的,分布式系统的额外开销客观存在。【结论】但COS指标仍能反映分布式系统存在的绝对性能较差、没有充分利用硬件优势等问题。
[Objective]Distributed computing systems are used widely in the field of big data processing.They are designed and implemented with a focus on scalability.With good scalability,a system can hold and process a growing amount of data by adding resources without modifying the system itself while sacrificing the absolute performance of a single machine at huge expenses.We want to offer a reasonable and modern metric to evaluate the performance of distributed systems.[Methods]In this article,we discuss the performance of distributed systems by comparing them with the same task on a single machine with the proposed metric,COS,or the Configuration that Outperforms a Single machine.The COS of a system on a given problem is the number of machines required when the system outperforms a competent single-machine implementation.Given a limited hardware resources,COS of a distributed system is usually too large to measure.So,we offer another metric by giving a parameter n to COS.COS(n)equals to n multiplied by the time used on n machines over that on a single machine.COS(n)indicates the performance and expense loss in a cluster system.We implemented two classic machine learning algorithms,k-means clustering and logistic regression,on a single machine with multi-threading,SIMD support and NUMA-aware memory control.[Results]Our experiments show that by using Apache Spark,with no matter its native API or optimized machine learning library like MLlib,it needs tens to hundreds of machines to achieve the same performance as we did on a single machine.[Limitations]The comparison between a single machine and a cluster is not entirely fair,for overheads in a cluster is unavoidable.[Conclusions]This COS metric can still reflect the problems of poor absolute performance and insufficient utilization of hardware advantages in distributed systems.
作者
李晓涵
陈文光
Li Xiaohan;Chen Wenguang(Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
出处
《数据与计算发展前沿》
2020年第1期93-104,共12页
Frontiers of Data & Computing