摘要
随着信息技术和互联网的发展,各种信息呈现爆炸性增长,且包含丰富的知识.从海量数据信息中挖掘得到有用的知识仍然是一个挑战性的课题.近几十年来,数据挖掘技术,作为从海量数据信息中挖掘有用信息的关键技术已经引起了广泛的兴趣和研究.但是由于数据规模的增长,以往的很多研究工作并不能有效地处理大规模数据,因此,开发设计或者扩展已有算法使之能处理大规模数据集,已经成为数据挖掘中非常重要的研究课题.近年来,基于云计算的数据挖掘技术研究已经成为一个热点话题,本文中我们研究开发一个基于大规模数据处理平台Hadoop的并行分布式数据挖掘工具平台PDMiner.在PDMiner中,开发实现了各种并行数据挖掘算法,比如数据预处理、关联规则分析以及分类、聚类等算法.实验结果表明,并行分布式数据挖掘工具平台PDMiner中实现的并行算法:1)能够处理大规模数据集,达到TB级别;2)具有很好的加速比性能;3)大大整合利用已有的计算资源,因为这些算法可以在由这些商用机器构建的并行平台上稳定运行,提高了计算资源的利用效率;4)可以有效地应用到实际海量数据挖掘中.此外,在PDMiner中还开发了工作流子系统,提供友好统一的接口界面方便用户定义数据挖掘任务.更重要的是,我们开放了灵活的接口方便用户开发集成新的并行数据挖掘算法.
With the development of information technology and internet, various types of information are increasing explosively. It is still a challenge to discover knowledge from massive information. As a pivotal technology to obtain knowledge, data mining has attracted a large amount of research interest for several decades; however, when dealing with large-scale data, most of previous works are still not as efficient as expected. Therefore, the extension of algorithms to deal with large-scale data and the improvement of executing efficiency have become important issues in data mining. Cloud computing based data mining has become a hot topic recently. In this paper, we develop a parallel and distributed data mining toolkit platform (PDMiner) based on large-scale data processing platform--Hadoop. In PDMiner, we propose to implement various data mining operations, such as data preprocessing, association rule analysis, classification and clustering in a parallel manner. The experimental results show that these parallel algorithms 1) can tackle large-scale data set, up to terabyte; 2) are very high efficiency, since they have good speedup; 3) are easily extended to execute in a cluster of commodity machines, which can make full use of computing resource; 4) are efficient for practical data mining. Additionally, we develop knowledge flow subsystem, which can facilitate the user to define data mining task in PDMiner. Furthermore, we can conveniently integrate new parallel algorithms into PDMiner through flexible interface.
出处
《中国科学:信息科学》
CSCD
2014年第7期871-885,共15页
Scientia Sinica(Informationis)
基金
国家自然科学基金(批准号:61175052
61203297
61035003)
国家高技术研究发展计划(863)(批准号:2014AA012205
2013AA01A606
2012AA011003)资助项目
关键词
云计算
并行算法
分布式
数据挖掘
大数据
cloud computing, parallel algorithms, distributed, data mining, big data