摘要
随着互联网在各个领域的不断发展,数据开始呈现结构多样化与体积海量化。面对海量数据的冲击,如何提高ETL的效率至关重要。针对“信息孤岛”中数据来源及格式皆不统一、数据采集实时性差的问题,提出垂直切分ETL工作流和水平切分待处理数据集,建立一种基于Storm平台的流式ETL处理方案。同时,针对Storm在进行任务分配时对工作节点CPU负载不敏感的缺点,通过定时任务记录工作节点的CPU负载信息,对Storm调度器的slot分配方式进行优化,使得Storm集群的负载更加均衡。实验结果证明该方案可有效提高ETL的处理效率,同时针对slot分配优化可有效地提高系统稳定性与处理效率。
With the continuous development of the Internet in various fields,data begin to show the characteristics of structural diversity and volumetric quantification.In the face of the impact of massive data,how to improve the efficiency of ETL is crucial.In view of the problem of inconsistent data source and format and poor real-time data collection in“information island”,this paper proposed a vertical segmentation ETL workflow and horizontal segmentation pending data set,and established a flow-based ETL processing scheme based on Storm platform.At the same time,for the shortcomings of Storm,which is insensitive to the CPU load of the working node during task assignment,the CPU load information of the working node is recorded by the timing task to optimize the slot allocation mode of the Storm scheduler,so that the load of the Storm cluster is more balanced.T he experimental results show that the scheme can effectively improve the processing efficiency of ETL,and the system stability and processing efficiency for slot allocation optimization.
作者
梁奎奎
LIANG Kui-kui(College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)
出处
《计算机科学》
CSCD
北大核心
2019年第S11期208-211,240,共5页
Computer Science
关键词
ETL
垂直切分
水平切分
STORM
负载优化
ETL
Vertical segmentation
Horizontal segmentation
Storm
Load optimization