期刊文献+

PDMiner:基于云计算的并行分布式数据挖掘工具平台 被引量:27

PDMiner:a cloud computing based parallel and distributed data mining toolkit platform
原文传递
导出
摘要 随着信息技术和互联网的发展,各种信息呈现爆炸性增长,且包含丰富的知识.从海量数据信息中挖掘得到有用的知识仍然是一个挑战性的课题.近几十年来,数据挖掘技术,作为从海量数据信息中挖掘有用信息的关键技术已经引起了广泛的兴趣和研究.但是由于数据规模的增长,以往的很多研究工作并不能有效地处理大规模数据,因此,开发设计或者扩展已有算法使之能处理大规模数据集,已经成为数据挖掘中非常重要的研究课题.近年来,基于云计算的数据挖掘技术研究已经成为一个热点话题,本文中我们研究开发一个基于大规模数据处理平台Hadoop的并行分布式数据挖掘工具平台PDMiner.在PDMiner中,开发实现了各种并行数据挖掘算法,比如数据预处理、关联规则分析以及分类、聚类等算法.实验结果表明,并行分布式数据挖掘工具平台PDMiner中实现的并行算法:1)能够处理大规模数据集,达到TB级别;2)具有很好的加速比性能;3)大大整合利用已有的计算资源,因为这些算法可以在由这些商用机器构建的并行平台上稳定运行,提高了计算资源的利用效率;4)可以有效地应用到实际海量数据挖掘中.此外,在PDMiner中还开发了工作流子系统,提供友好统一的接口界面方便用户定义数据挖掘任务.更重要的是,我们开放了灵活的接口方便用户开发集成新的并行数据挖掘算法. With the development of information technology and internet, various types of information are increasing explosively. It is still a challenge to discover knowledge from massive information. As a pivotal technology to obtain knowledge, data mining has attracted a large amount of research interest for several decades; however, when dealing with large-scale data, most of previous works are still not as efficient as expected. Therefore, the extension of algorithms to deal with large-scale data and the improvement of executing efficiency have become important issues in data mining. Cloud computing based data mining has become a hot topic recently. In this paper, we develop a parallel and distributed data mining toolkit platform (PDMiner) based on large-scale data processing platform--Hadoop. In PDMiner, we propose to implement various data mining operations, such as data preprocessing, association rule analysis, classification and clustering in a parallel manner. The experimental results show that these parallel algorithms 1) can tackle large-scale data set, up to terabyte; 2) are very high efficiency, since they have good speedup; 3) are easily extended to execute in a cluster of commodity machines, which can make full use of computing resource; 4) are efficient for practical data mining. Additionally, we develop knowledge flow subsystem, which can facilitate the user to define data mining task in PDMiner. Furthermore, we can conveniently integrate new parallel algorithms into PDMiner through flexible interface.
出处 《中国科学:信息科学》 CSCD 2014年第7期871-885,共15页 Scientia Sinica(Informationis)
基金 国家自然科学基金(批准号:61175052 61203297 61035003) 国家高技术研究发展计划(863)(批准号:2014AA012205 2013AA01A606 2012AA011003)资助项目
关键词 云计算 并行算法 分布式 数据挖掘 大数据 cloud computing, parallel algorithms, distributed, data mining, big data
  • 相关文献

参考文献16

  • 1Han J W, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. San Francisco: Morgan Kaufmann, 2011.
  • 2Luo P, Lu K, Huang R, et al. A heterogeneous computing system for data mining workflows in multi-agent environ- ments. Expert Syst, 2006, 23:258-272.
  • 3Zhuang F Z, He Q, Shi Z Z. Multi-agent based on automatic evaluation system for classification algorithm. In: Proceedings of International Conference on Information Automation, Zhangjiajie, 2008. 264-269.
  • 4Hameenanttila T, Guan X L, Carothers J D, et al. The flexible hypercube: a new fault-tolerant architecture for parallel computing. J Parallel Distr Com, 1996, 37:213-220.
  • 5Goudreau M W, Lang K, Rao S B, et al. Portable and efficient parallel computing using the BSP model. IEEE Trans Comput, 1999, 48:670-689.
  • 6Chu C T, Kim S K, Lin Y A, et al. Map-reduce for machine learning on multicore. In: Proceedings of Advances in Neural Information Processing Systems 19, Vancouver, 2006. 281-288.
  • 7Borthakur D. The hadoop distributed file system: architecture and design. Hadoop Project Website, 2007, 11:21.
  • 8Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM, 2008, 51:107-113.
  • 9Luo P,Lu K, Shi Z Z, et al. Distributed data mining in grid computing environments. Future Gener Comp Sys, 2007, 23:84-91.
  • 10Hall M, Frank E, Holmes G, et al. The weka data mining software: an update. ACM SIGKDD Explor Newsl, 2009, 11:10-18.

二级参考文献10

  • 1胡学刚 张东艳 胡春玲 等.一种新的基于粗糙集的决策树构造算法[J].计算机科学,2005,32(8):7-8,50.
  • 2曾黄麟.智能计算[M].重庆:重庆大学出版社,2004..
  • 3[美]Mitchell T M.机器学习[M].曾华军,张银奎,译.北京:机械工业出版社,2003.
  • 4Breiman L,Freidman J H,Olsen R A,et al.Classification and Regression Trees[M].Wadsworth International Group,CRC,1984.
  • 5Quinlan J R.Construction Decision Tree in C4.5:Programs of Machine Learning[M].Morgan Kaufman Publishers,1993.
  • 6Kamal Ali.On Explaining Degree of Error Reduction due to Combining Multiple Decision Trees.IBM Almaden Recearcher Centre,CA,1996.
  • 7Kamal Ali.On Explaining Degree of Error Reduction due to Combining Multiple Decision Trees.IBM Almaden Recearcher Centre,CA,1996.
  • 8Mastroianna C,Tailia D,Trunfio P.Managing Heterogeneous Resource in Data Mining Application on Grid Using XML-based Metadata.In Proceedings of IPDPS2003,IEEE Computer Society Press,2003.
  • 9Han J W,Kamber M.数据挖掘[M].范明,孟小峰,译.北京:机械工业出版社,2001.
  • 10Lim T J,Loh W Y,Shih.A Comparison and Prediction Accuracy,Complexity,and Training Time of Thirty-three Old and New Classification Algorithms,Machine Learning,2000,40(3):208-228.

共引文献3

同被引文献224

引证文献27

二级引证文献639

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部