基于MPI的并行大数据集生成器

A parallel large dataset generator based on MPI

下载PDF

导出

摘要大数据处理分析算法在优化研究过程中,速度常常受限于数据集的规模。在数据集体量不足时,算法的通信时间往往要高于真正的计算时间,无法验证真实的效果。故设计实现了一个大数据集生成器,为运行在超级计算机上的并行大数据处理分析算法提供基准测试数据集。首先,使用MPI并行编程技术构造了一个并行随机数生成器,在此基础上设计实现了可控制规模及复杂性的人工数据集,主要包括:分类和聚类数据集、回归数据集、流形学习数据集和因子分解数据集等。其次,设计了大数据集生成器的I/O系统,提供MPI-I/O并行读、写数据集的接口,并设置了数据集在不同进程间的分发、映射规则,通过点对点通信实现不同节点之间的数据交互。实验结果表明,并行大数据集生成器有效提高了数据生成效率和生成规模,为并行大数据处理分析算法提供了高质量、大体量的测试数据集。 The speed of big data processing and analysis algorithms in optimization research is often limited by the size of the dataset.In the case of insufficient data volume,the communication time of the algorithm is often higher than the real calculation time,and the real effect cannot be verified.Therefore,a large dataset generator is designed to provide benchmark datasets for parallel big data processing and analysis algorithms running on supercomputers.Firstly,a parallel random number generator is constructed using MPI parallel programming technology.On this basis,artificial datasets with controllable scale and complexity are implemented which mainly includes classification and clustering datasets,regression datasets,manifold Learning datasets,factorization datasets,etc.Besides,the I/O system of the large dataset generator is designed.The system provides interfaces for MPI-I/O parallel read and write datasets.It also sets the distribution and mapping rules of the dataset between different processes and realizes the data access between different nodes through point-to-point communication.Experimental results show that the parallel large dataset generator effectively improves the efficiency and scale of data generation,and provides high-quality,large-scale test datasets for big data processing and analysis algorithms.

作者葛旭冉刘洋陈志广肖侬 GE Xu-ran;LIU Yang;CHEN Zhi-guang;XIAO Nong(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;School of Computer,Sun Yat-sen University,Guangzhou 510006,China)

机构地区国防科技大学计算机学院中山大学计算机学院

出处《计算机工程与科学》 CSCD 北大核心 2022年第7期1152-1161,共10页 Computer Engineering & Science

基金国家重点研发计划(2018YFC1406205) 国家自然科学基金(U1811461,61872392) 广东省自然科学基金(2018B0303120) 广东省基础与应用基础研究(2019B030302002)。

关键词 MPI 大数据集生成器 I/O系统并行大数据处理算法算法测试 MPI large dataset generator I/O system parallel big data processing algorithm algorithm test

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1詹剑锋,高婉铃,王磊,李经伟,魏凯,罗纯杰,韩锐,田昕晖,姜春宇.BigDataBench:开源的大数据系统评测基准[J].计算机学报,2016,39(1):196-211. 被引量：34
2曹旭峰,江国华.一种适用于流式大数据系统测试的数据生成方法[J].计算技术与自动化,2017,36(3):139-145. 被引量：3
3何康,黄春,姜浩,谷同祥,齐进,刘杰.基于MPI的高精度归约函数设计与实现[J].计算机工程与科学,2021,43(4):594-602. 被引量：3
4肖侬,卢宇彤,卢锡城.并行计算的运行支撑系统技术的研究[J].计算机工程与科学,2000,22(1):96-99. 被引量：1
5朱晓玲,姜浩.任意概率分布的伪随机数研究和实现[J].计算机技术与发展,2007,17(12):116-118. 被引量：23

二级参考文献54

1张艳红,吴勇.基于Monte Carlo方法的任意概率密度随机数字信号发生器设计　[J].电子科技,2004,17(8):45-48. 被引量：3
2肖化昆.系统仿真中任意概率分布的伪随机数研究[J].计算机工程与设计,2005,26(1):168-171. 被引量：31
3赵雪峰.一种伪随机数生成算法的研究与实现[J].电脑学习,2005(6):25-26. 被引量：5
4张淑梅,李勇.计算机产生随机数的方法[J].数学通报,2006,45(3):44-45. 被引量：11
5Zaharia M, et al. Resilient distributed datasets: A fault- tolerant abstraction for in-memory cluster computing// Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. San Jose, USA, 2012 : 2-2.
6Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
7Graham-Rowe D, Goldston D, Doctorow C, et al. Big data: Science in the petabyte era. Nature, 2008, 455(7209): 8-9.
8Ghazal A, Rabl T, Hu M, et al. BigBench: Towards an industry standard benchmark for big data analytics//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York, USA, 2013 : 1197-1208.
9Huang S, Huang J, Dai J Q, et al. The HiBench benchmark suite : Characterization of the MapReduce-based data analysis //Proceedings of the ICDE Workshops on Information Software as Services. LongBeaeh, USA, 2010:41-51.
10Pavlo A, Paulson E, Rasin A, eta]. A comparison of approaches to farge-scale data analysis//Proceedings of the2009 ACM SIGMOD International Conference on Management of Data. Providence, USA, 2009:165-178.

共引文献59

1刘素华.元宇宙:物的媒介化的消逝与中国制造危机[J].福建论坛（人文社会科学版）,2022(12):54-62.
2汉泽西,甘志强.蒙特卡罗方法在动态测量不确定度分析中的应用[J].计量技术,2009(3):65-68. 被引量：1
3张慎,杜新喜,万金国.随机制作偏差影响下的网架结构性能分析[J].土木建筑与环境工程,2009,31(2):8-12. 被引量：3
4陈寿文,李明东.Matlab在蚁群聚类算法数据源产生中的应用[J].计算机技术与发展,2009,19(7):216-219. 被引量：2
5许孝臣,盛金昌,何淑媛,詹美礼,许明华.防渗帷幕随机缺损的模拟及对坝基渗流的影响[J].河海大学学报（自然科学版）,2009,37(5):582-585. 被引量：9
6石琴,李友文,郑与波.随机数在汽车行驶工况构建中的应用[J].西南交通大学学报,2010,45(6):938-945. 被引量：4
7陈震,肖熙.基于波面随机性的船舶底部砰击压力计算方法研究[J].中国舰船研究,2011,6(1):7-11. 被引量：4
8李莉华,冯志强,冉兵,赵春玲,张春来,盘强文,邬丽莎.缺血预处理减轻兔肾缺血再灌流损伤的研究[J].中国病理生理杂志,2000,16(5):461-461. 被引量：2
9曹加亮,谢善斌,刘遂庆.给水管网节点用水量生成器设计[J].西南给排水,2012,34(4):30-34.
10郭海凤.基于CUDA平台的伪随机数产生器系统研究[J].计算机技术与发展,2013,23(2):115-118. 被引量：1

1范培勤,过武宏,韩梅,唐帅,张驰.水声环境特征参数并行预报方法研究[J].计算机工程与科学,2021,43(11):1920-1925.
2王学朋.田湾3号、4号机组主仪控I/O点计算与问题解决[J].电工技术,2022(12):19-21.

计算机工程与科学

2022年第7期

浏览历史

内容加载中请稍等...

基于MPI的并行大数据集生成器

参考文献5

二级参考文献54

共引文献59

相关作者

相关机构

相关主题

浏览历史