基于工程机械设备数据倾斜问题分析与研究

Analysis and Research on Data Skew of Engineering Machinery and Equipment

下载PDF

导出

摘要数据倾斜是机械设备大数据计算中最常见和最棘手的问题。机械设备数据是多种多样和复杂的,只要数据倾斜,就会有大量的数据计算任务集中在同一个节点或分区中,而其他节点或分区计算任务完成后,数据倾斜节点就会有多余的计算任务,这不仅会增加任务的计算时间,还会增加程序内存的概率。此外,集群资源利用率和集群计算性能可能较低。本文设计搭建了一个基于Spark的工程机械设备监测大数据平台,主要完成工程机械设备传感器数据的存储、清洗和业务统计,除此之外,平台还支持自定义监测网页。将大数据存储和计算模块和Wcb可视化模块整合于一体,让设备管理人员和业务分析人员能够直观有效的管理工程设备和进行业务分析。 Data skew is the most common and intractable problem in big data computing of mechanical equipment.Mechanical equipment data are varied and complex.When the data skew,there will be a large amount of data computing tasks together in the same node or partitions,while other nodes or patition computing tasks are complete,data skew node has a surplus computing tasks,which will not only increase the computation time task,but also can increase the probability of the program memory.In addition,cluster resource utilization and computing performance may be low.In this paper,a big data platform for construction machinery equipment monitoring is designed and built based on Spark,which mainly completes the storage,cleaning and business statistics of sensor data of construction machinery equipment.Moreover,the platform also supports custom monitoring web pages.The big data storage and computing module and Web visualization module are integrated into one so that equipment managers and servjce analysts can intuitively and efectively manage engineering equipment and conduct service analysis.

作者杨沙沙黄艳 YANG Shasha;HUANG Yan(School of ZTE,Xi'an Traffic Engineer Institute,Xi'an 710300)

机构地区西安交通工程学院中兴通信学院

出处《西安交通工程学院学术研究》 2022年第2期36-40,共5页 Academic Research of Xi'an Traffic Engineering Institute

关键词数据倾斜相同节点分区性能 data skew same node partitions performance

分类号 O121.8 [理学—基础数学] G558 [文化科学—教育技术学]

引文网络
相关文献

参考文献10

1阎博,张昊,郭子明,王东升,刘蒙.基于多源数据融合的电网故障综合分析与智能告警技术研究与应用[J].中国电力,2018,51(2):39-46. 被引量：38
2胡必波,彭平,李散散.Hadoop MapReduce与Spark 的Shuffle过程原理[J].信息技术与信息化,2021(5):63-66. 被引量：6
3王兴,鲍志伟.适用于高速检索的完美Hash函数[J].计算机系统应用,2016,25(2):250-256. 被引量：2
4文明波,丁治明.适用于云计算的面向查询数据库数据分布策略[J].计算机科学,2010,37(9):168-172. 被引量：21
5许丹亚,王晶,王利,张伟功.基于Spark的大数据访存行为跨层分析工具[J].计算机研究与发展,2020,57(6):1179-1190. 被引量：10
6翟俊海,齐家兴,沈矗,宋丹丹,王谟瀚,田石.基于MapReduce和Spark的大数据主动学习比较研究[J].计算机工程与科学,2019,41(10):1715-1722. 被引量：6
7吴恩慈.广播机制解决Shuffle过程数据倾斜的方法[J].计算机系统应用,2019,28(6):189-197. 被引量：4
8张占峰,王文礼,耿珊珊,贾芝婷.Spark数据倾斜问题研究[J].河北省科学院学报,2020,37(1):1-7. 被引量：3
9邸宏宇,张静,于毅,王连印.一种基于改进模糊哈希的文件比较算法研究[J].信息网络安全,2016(11):12-18. 被引量：3
10卞琛,修位蓉,于炯.异构Spark集群数据倾斜修正调度策略[J].计算机工程与科学,2022,44(4):620-630. 被引量：1

二级参考文献63

1任惠,米增强,赵洪山.基于编码PETRI网的电力系统故障诊断模型研究[J].中国电机工程学报,2005,25(20):44-49. 被引量：39
2赵伟,白晓民,丁剑,方竹,李再华.基于协同式专家系统及多智能体技术的电网故障诊断方法[J].中国电机工程学报,2006,26(20):1-8. 被引量：106
3Codd E F.A relational model for large shared data banks[J].Comm.ACM,1970,13(6):377-387.
4Ghemawat S,Gobioff H,Leung Shun-Tak.The Google File System[J].SIGOPS Operating Systems Review,2003,37(5).
5Chang F,Dean J,Ghemawat S,et al.Bigtable:A DistributedStorage System for Structured Data[C] ∥7th Symposium on Operating Systems Design and Implementation(OSDI 2006).Seat-tle,WA,USA,November 2006:205-218.
6Dean J,Ghemawat S.MapReduce:Simplified data processing on large clusters[J].Communications of the ACM,2005,51(1):107-113.
7Sylvain G,Le G.Using Cluster Computing to Support Automa-tic and Dynamic Database Clustering[C] ∥Third International Workshop on Automatic Performance Tuning(iWAPT).2008:394-401.
8Guinepain S,Gruenwald L.Automatic Database Clustering U-sing Data Mining[C] ∥Database and Expert Systems Applications,2006(DEXA '06).17th International Conference.2006:124-128.
9Zhong Ke,Dutt S.Effective partition-driven placement with simul-taneous level processing and global net views[C] ∥IEEE/ACM International Conference on Computer Aided Design.Nov.2000:254-259.
10Jean-Daniel C,Alain A,Alain A.Criteria to Compare CloudComputing with Current Database Technology[C] ∥Dumke R,et al.,eds.IWSM / MetriKon/Mensura LNCS 5338.2008:114-126.

共引文献83

1焦毅,李琳,王颖慧,叶南荣.一种面向企业私有云的数据分布策略[J].计算机研究与发展,2011,48(S3):239-244. 被引量：5
2康一梅,胡江,王冠.一种用于SaaS模式云服务的树型云数据库[J].电信科学,2012,28(1):37-41. 被引量：6
3赵宏伟,宋宝燕,邵一川.云计算环境下的一种高效的资源管理策略[J].计算机科学,2012,39(2):212-215. 被引量：12
4孙熙领,陈超,丁治明,许佳捷,袁栋.云计算环境中基于访问量和依赖性评价的数据分配算法[J].计算机科学,2012,39(5):141-146. 被引量：1
5李超零,陈越,谭鹏许,杨刚,李文俊.基于分解与加密的云数据库隐私保护机制研究[J].信息工程大学学报,2012,13(3):376-384. 被引量：10
6陈真.改进蚁群算法在云环境下路径优化设计[J].江西理工大学学报,2012,33(3):66-70. 被引量：4
7潘惠勇,高丽平,刘扬.云计算环境中多维数据集的查询与分布策略研究[J].微电子学与计算机,2012,29(8):24-27. 被引量：1
8廖峰,成静静.基于云计算的数据库云方案的研究与设计[J].数据通信,2012(3):45-48. 被引量：6
9成静静.基于Hadoop的分布式云计算/云存储方案的研究与设计[J].数据通信,2012(5):14-18. 被引量：29
10陈真.基于蚁群优化算法的云计算资源分配[J].青岛科技大学学报（自然科学版）,2012,33(6):619-623. 被引量：8

1杨宗泽,徐艳,胡千蓉,王亚曦,张姝,李淑君,胡迅,陈蕾,黄伟.生物样本库专业人才培训体系的建立与实践[J].中国医药生物技术,2022,17(4):354-357.
2王伟,李勇.电厂球阀的结构设计及强度性能仿真[J].机电工程技术,2022,51(7):199-202. 被引量：1

西安交通工程学院学术研究

2022年第2期

浏览历史

内容加载中请稍等...

基于工程机械设备数据倾斜问题分析与研究

参考文献10

二级参考文献63

共引文献83

相关作者

相关机构

相关主题

浏览历史