COS:度量分布式大数据处理系统的效率被引量：1

COS:Measuring the Efficiency of Distributed Big Data Processing System

下载PDF

导出

摘要【目的】在大数据处理领域,分布式计算系统得到广泛应用,它们的可扩展性得到重点关注,但其绝对性能往往没有得到重视。我们希望提出科学合理、与时俱进的度量标准,对分布式系统的性能进行评估。【方法】本文通过对比特定任务的单机实现和分布式实现来讨论分布式系统的性能,提出COS(Configuration that Outperforms a Single machine)这一指标,来衡量分布式系统在达到单台机器的性能时,需要的硬件资源数量。我们选取k-means聚类和逻辑回归两个经典机器学习算法,对其进行单机多线程实现,并通过向量化计算、优化内存分配与访问等方式对性能进行了优化,为分布式多机系统的性能提供参考。【结果】以Apache Spark作为对标系统,实验发现无论是使用其原生编程接口,还是经过悉心优化的机器学习库,都要使用数倍甚至数百倍的机器,才能达到单机多线程实现的性能。【局限】分布式系统与单机实现进行性能对比并不是完全公平的,分布式系统的额外开销客观存在。【结论】但COS指标仍能反映分布式系统存在的绝对性能较差、没有充分利用硬件优势等问题。 [Objective]Distributed computing systems are used widely in the field of big data processing.They are designed and implemented with a focus on scalability.With good scalability,a system can hold and process a growing amount of data by adding resources without modifying the system itself while sacrificing the absolute performance of a single machine at huge expenses.We want to offer a reasonable and modern metric to evaluate the performance of distributed systems.[Methods]In this article,we discuss the performance of distributed systems by comparing them with the same task on a single machine with the proposed metric,COS,or the Configuration that Outperforms a Single machine.The COS of a system on a given problem is the number of machines required when the system outperforms a competent single-machine implementation.Given a limited hardware resources,COS of a distributed system is usually too large to measure.So,we offer another metric by giving a parameter n to COS.COS(n)equals to n multiplied by the time used on n machines over that on a single machine.COS(n)indicates the performance and expense loss in a cluster system.We implemented two classic machine learning algorithms,k-means clustering and logistic regression,on a single machine with multi-threading,SIMD support and NUMA-aware memory control.[Results]Our experiments show that by using Apache Spark,with no matter its native API or optimized machine learning library like MLlib,it needs tens to hundreds of machines to achieve the same performance as we did on a single machine.[Limitations]The comparison between a single machine and a cluster is not entirely fair,for overheads in a cluster is unavoidable.[Conclusions]This COS metric can still reflect the problems of poor absolute performance and insufficient utilization of hardware advantages in distributed systems.

作者李晓涵陈文光 Li Xiaohan;Chen Wenguang(Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)

机构地区清华大学计算机科学与技术系

出处《数据与计算发展前沿》 2020年第1期93-104,共12页 Frontiers of Data & Computing

关键词并行计算大数据多线程 K-MEANS 逻辑回归 parallel computing big data multi-thread k-means logistic regression

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TP316.4 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

同被引文献12

1邵必林,王莎莎.基于负载预测的HDFS动态负载均衡改进算法[J].探测与控制学报,2019,41(2):75-80. 被引量：5
2刘军,冷芳玲,李世奇,鲍玉斌.基于HDFS的分布式文件系统[J].东北大学学报（自然科学版）,2019,40(6):795-800. 被引量：20
3戴威.一种跨HDFS集群的文件资源分布式高效存储方法[J].电子设计工程,2019,27(21):14-17. 被引量：4
4贾旖旎,周新民,曹芳.基于HDFS+Spark的时空大数据存储与处理——以智慧无锡时空大数据为例[J].软件,2019,40(11):19-23. 被引量：8
5刘汪根,孙元浩.大数据3.0——后Hadoop时代大数据的核心技术[J].数据与计算发展前沿,2019,1(1):94-104. 被引量：6
6樊玉琦,张蓓,王伦飞.面向电耗与网络同步代价优化的数据副本放置研究[J].计算机工程,2020,46(2):110-117. 被引量：1
7金国栋,卞昊穹,陈跃国,杜小勇.HDFS存储和优化技术研究综述[J].软件学报,2020,31(1):137-161. 被引量：36
8郭威,谢光伟,张帆,李敏.一种分布式存储系统拟态化架构设计与实现[J].计算机工程,2020,46(6):12-19. 被引量：20
9李慧,李贵洋,胡金平,周悦,江小玉,韩鸿宇.基于分布式存储的OHitchhiker码[J].计算机工程与设计,2020,41(7):1941-1946. 被引量：1
10齐超,崔然.基于递归随机搜索算法的Hadoop平台大数据软件系统研究[J].软件,2020,41(6):177-184. 被引量：7

引证文献1

1赵文瑄,Byung-Won Min.大数据中心处理系统性能优化问题研究[J].自动化与仪器仪表,2021(11):107-110.

1Xilinx统一软件平台Vitis正式开放下载[J].单片机与嵌入式系统应用,2020,20(1):64-64.
2李刚刚,鲁习文.目标为最小化工件运输时间和的单台机器带一个维修时间段的排序问题的一个改进算法[J].运筹学学报,2019,23(4):95-104.
3韩金池,郭建东,文欢,常雪姣,王钦松.热力管道应力分布式计算系统研究[J].中国设备工程,2020,0(2):147-149.
4姬秀娟,孙晓卉,许静.基于复杂控制流的源代码内存泄漏静态检测[J].计算机科学,2019,46(S11):517-523. 被引量：1
5柳影.基层行政事业单位专项资金管理的问题及优化策略分析[J].时代经贸,2020,0(8):76-77. 被引量：10
6王梅,雒芬,张保华.噪声大数据的MapReduce高度随机模糊森林算法[J].西南师范大学学报（自然科学版）,2019,44(11):110-117.
7仲浩,文继峰,周强,周谷庆,赵天恩.C66AK多核控制系统数据内存分配研究[J].单片机与嵌入式系统应用,2019,19(12):19-22.
8张晓东,张林让.基于GPU的相控阵雷达并行仿真技术[J].计算机仿真,2019,36(12):20-24. 被引量：3
9朱蓉.Spark计算节点同构环境下Executor的内存分配优化模型[J].进展,2020(1):49-52.
10王艳,李念爽,王希龄,钟凤艳.编码技术改进大规模分布式机器学习性能综述[J].计算机研究与发展,2020,57(3):542-561. 被引量：6

数据与计算发展前沿

2020年第1期

浏览历史

内容加载中请稍等...

COS:度量分布式大数据处理系统的效率被引量：1

同被引文献12

引证文献1

相关作者

相关机构

相关主题

浏览历史

COS:度量分布式大数据处理系统的效率 被引量：1

同被引文献12

引证文献1

相关作者

相关机构

相关主题

浏览历史

COS:度量分布式大数据处理系统的效率被引量：1