摘要
在处理某大型企业的海量后勤大数据时,传统的基于MapReduce的ETL技术在数据提取、转换过程中,因为频繁进行磁盘读取的原因,存在数据处理效率不足的问题。考虑到Spark是基于内存操作的计算引擎,不依赖于磁盘操作,对数据的提取、转换效率的提升有一定帮助,因此文中采用了基于Spark的分布式ETL技术来处理这些海量数据,并通过实验进行效率比较。
In dealing with the massive logistics big data of a large enterprise,the traditional MapReduce-based ETL technology has insufficient data processing efficiency in the process of data extraction and conversion because of frequent disk reading.Considering that Spark is a computational engine based on memory operation,it does not depend on disk operations,which is helpful for data extraction and conversion efficiency.Therefore,Spark-based distributed ETL technology is used in this paper to process these massive data and the efficiency is compared though experiments.
作者
张野
姚文明
ZHANG Ye;YAO Wen-ming(North China Institute of Computing Technology,Beijing 100083,China)
出处
《信息技术》
2019年第12期165-168,共4页
Information Technology