摘要
随着数据挖掘逐渐被应用到金融、娱乐、商业和医疗等多个行业,近年来涌现出各种用于处理海量数据的数据处理引擎,如MapReduce、Spark等。为了使高校实验室更深入地进行数据挖掘领域的理论研究,简要阐述分析了Spark技术及HDFS的概念与基本原理,详细介绍了基于Spark的云计算平台配置方法和实现过程,并对平台搭建过程中遇到的问题进行总结。实验结果证明,该平台能够有效完成分布式数据处理任务。
In recent years,a variety of open source data processing engines are emerging such as MapReduce,Spark,etc.,which are used to efficiently handle massive amounts of data.In order to offer the laboratory of the university deeper research of the field of data mining,this paper briefly analyzes the concept and basic principle of Spark technology and HDFS,and then introduces configuration method and implementation process of the Spark-based cloud computing platform.The experiments show that the platform can effectively complete the distributed data processing tasks.
作者
张恬恬
孙绍华
ZHANG Tian-tian;SUN Shao-hua(School of Computer Science,Xi’an Shiyou University,Xi’an 710065,China)
出处
《软件导刊》
2018年第4期191-193,共3页
Software Guide
基金
西安石油大学研究生创新与实践能力培养项目(YCS17131014)