摘要
提出了一种基于Spark云计算平台的并行数据分析系统。该系统以大规模图数据分析任务为主,并且支持非图数据分析的应用,集成数据分析算法集与非图数据分析算法集。详细阐述了该系统的架构设计,工作流引擎和动态组件更新技术以及部分并行数据分析算法的设计与实现。通过对多种规模的数据集进行性能测试,以及与传统的Map Reduce平台进行性能对比,证明了该系统相对于以往的图数据挖掘系统可以更高效地完成计算任务,而且也可以有效进行非图数据分析。
This paper proposes a parallel data analysis system based on the cloud computing platform of Spark. This system mainly aims at large-scale graph data analysis tasks, supports analysis applications of non-graph data, and integrates the sets of data analysis algorithms and non-graph data analysis algorithms. Then, this paper describes the design and implementation of the system, as well as workflow engine and dynamic component update technology,part of the parallel data analysis algorithms. Through tests of multiple scales of datasets and performance comparison with traditional Map Reduce platform, this paper proves that the system is more efficient at completing computing tasks compared with the previous graph data mining system, and can analyze efficiently non-graph data.
出处
《计算机科学与探索》
CSCD
北大核心
2015年第9期1066-1074,共9页
Journal of Frontiers of Computer Science and Technology
基金
教育部-中国移动科研基金No.MCM20130351
北京市教育委员会共建项目~~
关键词
云计算
并行算法
图数据分析
数据挖掘
社会网络分析
cloud computing
parallel algorithms
graph data analysis
data mining
social network analysis