Spark任务间消息传递方法研究被引量：2

Exploring Message Passing Method Between Spark Tasks

下载PDF

导出

摘要当今诸多工程问题及科学研究中,都面临着大数据处理和高性能计算任务的双重挑战。基于内存计算技术提出的分布式处理框架Spark已在学术和工业界得到了广泛的应用,但其MapReduce-like的编程模型在任务间无法进行通信,导致科学计算中的数值算法无法进行高效实现。针对上述问题,研究了一种Spark内存计算与MPI消息传递模型相结合的解决方案,充分利用内存访问存取快速的特点和MPI的多种高性能通信机制,解决了Spark编程模型表达能力不足的缺陷,同时为MPI提供了面向数据的DAG计算方式。通过对Spark内部的运行环境和调度系统进行修改,使得MPI在Spark中得以无缝融合,为高性能计算和大数据任务提供了一个统一的内存计算系统。测试结果表明,在数值计算和迭代算法上相比Spark至少有50%的性能提升。 Engineering problems and scientific research are facing dual challenges of big data processing and highperformance computing tasks.Spark,a distributed processing framework based on in-memory computing technology,has been widely used in academia and industry.However,its MapReduce-like programming model fails to communicate between tasks,causing numerical algorithms in scientific computing cannot be efficiently implemented.In response to the above problems,a computing system is proposed in this paper that combines Spark in-memory computing model with MPI message passing,which takes full advantage of the fast speed of memory access and multiple high performance communication mechanisms of MPI.It can not only supplement the insufficient expressiveness of the Spark programming model,but also provide a data-oriented DAG computation method for MPI.Internal runtime environment and scheduling strategy of Spark are modified to seamlessly integrate MPI into Spark to provide a unified in-memory computing system for high-performance computing and big data processing tasks.The tests indicate that the performance of numerical computation and iterative algorithm is improved by at least 50%compared with Spark.

作者夏立斌刘晓宇孙玮姜晓巍孙功星 XIA Libin;LIU Xiaoyu;SUN Wei;JIANG Xiaowei;SUN Gongxing(Institute of High Energy Physics,Chinese Academy of Sciences,Beijing 100049,China;University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区中国科学院高能物理研究所中国科学院大学

出处《计算机工程与应用》 CSCD 北大核心 2022年第21期91-97,共7页 Computer Engineering and Applications

基金国家自然科学基金(12275295,11775249)。

关键词 SPARK MPI 科学计算内存计算迭代算法 Spark MPI scientific computing in-memory computing iterative algorithm

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献3

1梁帆,鲁小亿.Accelerating Iterative Big Data Computing Through MPI[J].Journal of Computer Science & Technology,2015,30(2):283-294. 被引量：5
2魏占辰,刘晓宇,黄秋兰,孙功星.Spark迭代密集型应用的优化方法研究[J].计算机工程与应用,2020,56(23):68-73. 被引量：3
3陈莹,丁亨通,冯旭,傅子文,宫明,桂龙成,荔宁,刘川,刘柳明,刘玉斌,刘朝峰,马建平,孙鹏,吴佳俊,吴良凯,杨一玻,张剑波.格点量子色动力学在中国[J].现代物理知识,2020,32(1):36-44. 被引量：2

二级参考文献30

1Page L, Brin S, Motwani R, Winograd T. The Pa.geRank citation ranking: Bringing order to the web, Technical Re- port, 1999-66, Stanford InfoLab, Nov. 1999.
2MacQueen J. Some methods for classification and analysis of multivariate observations. In Proe. the 5th Berkeley Sym- posium on Mathematical Statistics and Probability, 1967, pp.281-297.
3Dean J, Ghemawat S. Mapeduce: Simplified data process- ing on large clusters. Communications of the ACM, 2008, 51(1):107-113.
4Lam C. Hadoop in Action. New Jersey, USA: Manning Pub- lications Co., 2010.
5Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI2012), April 2012, pp.2:1-2:14.
6Lu X, Islam N S, Wasi-Ur-Rahman M, Jose J, Subramoni H, Wang H, Panda D K. High-performance design of Hadoop RPC with RDMA over InfiniBand. In Proc. the 42nd In- tewational Conference on Parallel Pvcessing ( ICPP2013), October 2013, pp.641-650.
7Rahman M, Islam N, Lu X, Jose J, Subramoni H, Wang H, Panda D. High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In Proc. the 27th In- ternational Symposium on Parallel and Distributed Pro- cessing Workshops and PhD Forum (1PDPSW2013), May 2013, pp.1908-1917.
8Rahman M, Lu X, Islam N S, Panda D K D. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In Proc. the 28th ACM International Conference on Supercomput- ing (ICS2014), December 2014, pp.33-42.
9Lu X, Rahman M, Islam N, Shankar D, Panda D K. Ac- celerating spark with RDMA for big data processing: Early experiences. In Proc. the 22nd Annual Symposium on High- Perforaance Interconnects (HOTI2014), August 2014, pp. 9-16.
10Plimpton S J, Devine K D. MapReduce in MPI for large- scale graph algorithms. Parallel Comput., 2011, 37(9): 610- 632.

共引文献7

1马伟锋,李伟.遥感影像数据并行计算中数据分配策略研究[J].浙江工业大学学报,2016,44(3):270-274. 被引量：4
2王刘旺,朱永利,贾亚飞.一种多源海量局部放电信号脉冲的并行提取方法[J].系统仿真学报,2017,29(1):57-66. 被引量：1
3徐利平,胡兴.控制数据一体化火电厂应用实践[J].电力大数据,2018,21(1):18-25. 被引量：10
4Zheng-Hao Jin,Haiyang Shi,Ying-Xin Hu,Li Zha,Xiaoyi Lu.CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance[J].Journal of Computer Science & Technology,2020,35(1):194-208.
5徐旸,王佳斌,彭凯.结合PCA的t-SNE算法的并行化实现方法[J].华侨大学学报（自然科学版）,2022,43(5):685-692.
6刘晓宇,夏立斌,姜晓巍,孙功星.HDFS分级存储系统元数据管理方法的研究[J].计算机工程与应用,2023,59(17):257-265. 被引量：7
7洪浩艺,高美琪,桂龙成,华俊,梁剑,史君,邹锦涛.格点量子色动力学数据的虚部分布与信号改进[J].物理学报,2023,72(20):45-52.

同被引文献11

1刘祎,张留山.数据库在数据融合系统中的应用[J].舰船电子对抗,2004,27(4):30-34. 被引量：4
2薛尧予,王建林,赵利强.分布式过程实时数据集成方法及其实现[J].计算机工程,2010,36(3):55-57. 被引量：4
3张嗜军,高曙.一种改进的增量式JVM垃圾收集算法[J].计算机工程,2012,38(1):71-73. 被引量：2
4马健,张太红,陈燕红.中文搜索引擎分块倒排索引存储模式[J].计算机应用,2013,33(7):2031-2036. 被引量：10
5黄廷辉,王玉良,汪振,崔更申.基于内存与文件共享机制的Spark I/O性能优化[J].计算机工程,2017,34(3):1-6. 被引量：8
6廖旺坚,黄永峰,包从开.Spark并行计算框架的内存优化[J].计算机工程与科学,2018,40(4):587-593. 被引量：10
7刘翔,童薇,刘景宁,冯丹,陈劲龙.动态内存分配器研究综述[J].计算机学报,2018,41(10):2359-2378. 被引量：7
8Wei Jiang,Liu-Gen Xu,Hai-Bo Hu,Yue Ma.Improvement Design for Distributed Real-Time Stream Processing Systems[J].Journal of Electronic Science and Technology,2019,17(1):3-12. 被引量：4
9金国栋,卞昊穹,陈跃国,杜小勇.HDFS存储和优化技术研究综述[J].软件学报,2020,31(1):137-161. 被引量：36
10丁世来,陈克澎,葛智君,李浩波,舒宁.多源试验数据重构与融合存储技术研究[J].电子产品可靠性与环境试验,2022,40(1):11-15. 被引量：3

引证文献2

1夏立斌,刘晓宇,姜晓巍,孙功星.基于分布式数据集的并行计算框架内存优化方法[J].计算机工程,2023,49(4):43-51. 被引量：5
2曹芳芳,任慧敏,上官子粮,丁派克.面向装备试验数据的融合存储技术研究与应用[J].软件工程,2023,26(11):25-28. 被引量：1

二级引证文献6

1赵卓峰,陈元,梅宇生.面向数据湖存取性能优化的数据并行处理技术研究[J].北方工业大学学报,2024,36(3):1-10.
2张腾泽,李旭军,饶立明.改进YOLOv5的遥感图像小目标检测算法[J].计算机时代,2023(12):89-95. 被引量：1
3张迅.基于分布式框架的电力营销运营风险监控平台[J].信息与电脑,2023,35(22):136-138.
4李坤朋.多源异构数据融合与高性能图数据库查询引擎设计[J].移动信息,2024,46(2):185-187.
5苏波.面向新工科的“并行计算与程序设计”课程思政教学研究与实践[J].高教研究（西南科技大学）,2024,40(1):55-59.
6乔仕岭,刘晨,王学松,孙林,丁光亮.仿真建模工具内存分配优化[J].科技创新与应用,2024,14(22):46-49.

1王宁静,郭树行.面向数据要素可信流通的金融科技数据产品评估研究[J].互联网周刊,2022(20):22-25. 被引量：1
2彭茜珍,胡莉.Intel TSX指令探究及其应用[J].湖北科技学院学报,2022,42(6):152-156. 被引量：1
3王荣阳,曲国远,童歆,陈昊,李威,徐佩园.面向机载应用的领域专用加速器研究[J].航空电子技术,2022,53(3):1-8. 被引量：1

计算机工程与应用

2022年第21期

浏览历史

内容加载中请稍等...

Spark任务间消息传递方法研究被引量：2

参考文献3

二级参考文献30

共引文献7

同被引文献11

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

Spark任务间消息传递方法研究 被引量：2

参考文献3

二级参考文献30

共引文献7

同被引文献11

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

Spark任务间消息传递方法研究被引量：2