基于Hadoop平台的C4.5算法的分析与研究被引量：5

Analysis and Study of C4.5 Algorithm Based on Hadoop Platform

下载PDF

导出

摘要如何能从海量数据中以更快速、高效、低成本的方式挖掘出有价值的信息成为如今数据挖掘技术面临的新课题。文中在研究Hadoop平台的特征和决策树的C4.5算法的过程中,决定在决策树算法领域中引入云计算思维,实现其在Hadoop平台上的并行化,并且采用MapReduce模型来解决海量数据挖掘问题。最后用打高尔夫球的数据集对新的算法进行验证。实验结果表明对海量数据,基于Hadoop平台的决策树算法可以明显提高数据挖掘的效率,具有可观的高效性和可扩展性,在一定程度上解决了C4.5算法在处理海量数据时计算量大、构建决策树时间长的问题。 How can dig out the valuable information from the vast amount of data in a more rapid,efficient and low-cost way now be-come a new task faced by the data mining technology. In this paper,in the study of the characteristics of the Hadoop platform and the process of decision tree C4. 5 algorithm,decide to introduce the cloud computing thinking to the field of decision tree algorithm,achieve its parallelization on Hadoop platform and use MapReduce model to solve the problem of massive data mining. Finally with using a round of golf data sets to verify this new algorithm,the results of the experiments show that for the huge amounts of data,the decision tree algo-rithm based on Hadoop platform can significantly improve the efficiency of data mining. It has a good efficiency and scalability. In a cer-tain extent,it also solves the problems of computing huge amounts of data and building the decision tree taking long time that C4. 5 algo-rithm faced when dealing with large amount of calculation.

作者孙媛黄刚

机构地区南京邮电大学计算机学院

出处《计算机技术与发展》 2014年第11期83-86,90,共5页 Computer Technology and Development

基金国家自然科学基金资助项目(61171053)

关键词 HADOOP MAPREDUCE 数据挖掘 C4.5算法 Hadoop MapReduce data mining C4.5 algorithm

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献12

1Liu Yang, Li Maozhen, Alham N K. HSIM:A MapReduce sim- ulator in enabling cloud computing [ EB/OL ]. [ 2012 - O1 - 20 ]. http://www, sciencedirect, com/science/article/pii/ S0167739X11000884.
2Ghemawat S, Gobioff H, Leung S T. The Google file system [ C]//Proceedings of the 19th ACM symposium on operating systems principles. New York : ACM Press, 2003 : 29-43.
3周婷,张君瑛,罗成.基于Hadoop的K-means聚类算法的实现[J].计算机技术与发展,2013,23(7):18-21. 被引量：24
4Alham N K, Li Maozhen, Liu Yang, et al. A MapReduce- based distributed SVM algorithm of automatic image annota- tion [ J ]. Computers and Mathematics with Applications,2011, 62(7) :2801-2811.
5陈康,郑纬民.云计算:系统实例与研究现状[J].软件学报,2009,20(5):1337-1348. 被引量：1310
6袱云龙.云计算平台下的数据挖掘研究[D].南京:南京邮电大学,2013.
7康塔尼克闪四清译.数据挖掘:概念、模型、方法和算法[M].北京:清华大学出版社,2003..
8唐华松,姚耀文.数据挖掘中决策树算法的探讨[J].计算机应用研究,2001,18(8):18-19. 被引量：120
9朱敏,万剑怡,王明文.基于MR的并行决策树分类算法的设计与实现[J].广西师范大学学报（自然科学版）,2011,29(1):82-86. 被引量：8
10蒋良孝,蔡之华,刘钊.一种基于信息增益的分类规则挖掘算法[J].中南大学学报（自然科学版）,2003,34(z1):69-71. 被引量：8

二级参考文献61

1魏红宁.基于SPRINT方法的并行决策树分类研究[J].计算机应用,2005,25(1):39-41. 被引量：18
2郭玉滨.一种基于离散度的决策树改进算法[J].山东师范大学学报（自然科学版）,2006,21(3):129-131. 被引量：3
3Sims K. IBM introduces ready-to-use cloud computing collaboration services get clients started with cloud computing. 2007. http://www-03.ibm.com/press/us/en/pressrelease/22613.wss
4Boss G, Malladi P, Quan D, Legregni L, Hall H. Cloud computing. IBM White Paper, 2007. http://download.boulder.ibm.com/ ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf
5Zhang YX, Zhou YZ. 4VP+: A novel meta OS approach for streaming programs in ubiquitous computing. In: Proc. of IEEE the 21st Int'l Conf. on Advanced Information Networking and Applications (AINA 2007). Los Alamitos: IEEE Computer Society, 2007. 394-403.
6Zhang YX, Zhou YZ. Transparent Computing: A new paradigm for pervasive computing. In: Ma JH, Jin H, Yang LT, Tsai JJP, eds. Proc. of the 3rd Int'l Conf. on Ubiquitous Intelligence and Computing (UIC 2006). Berlin, Heidelberg: Springer-Verlag, 2006. 1-11.
7Barroso LA, Dean J, Holzle U. Web search for a planet: The Google cluster architecture. IEEE Micro, 2003,23(2):22-28.
8Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 1998,30(1-7): 107-117.
9Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the 19th ACM Symp. on Operating Systems Principles. New York: ACM Press, 2003.29-43.
10Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Proc. of the 6th Symp. on Operating System Design and Implementation. Berkeley: USENIX Association, 2004. 137-150.

共引文献1484

1查伟,孙燕琼,郑继平.基于云测试架构的FIVP解决方案[J].铁路技术创新,2021(S01):82-86.
2林少伟.人工智能法律主体资格实现路径:以商事主体为视角[J].中国政法大学学报,2021(3):165-177. 被引量：5
3胡祖林,肇杰.云计算下的网盘安全[J].计算机产品与流通,2020,0(1):164-164.
4张盛,任伟,王玉,黄金明,陈旭彤.基于Web的重力异常正演建模工具[J].地质论评,2023,69(S01):595-597.
5赵文韬.基于5G技术的黑龙江云计算产业发展[J].电子技术（上海）,2020,49(9):186-187.
6Longfei He,Mei Xue,Bin Gu.Internet-of-things enabled supply chain planning and coordination with big data services:Certain theoretic implications[J].Journal of Management Science and Engineering,2020,5(1):1-22. 被引量：5
7吴劲松,陈孚.云计算发展及应用研究[J].广西通信技术,2011(2):9-13. 被引量：5
8黄纬,温志萍,程初.云计算中基于K-均值聚类的虚拟机调度算法研究[J].南京理工大学学报,2013,37(6):807-812. 被引量：17
9孙凌宇,欧阳春娟,冷明,刘昌鑫,夏洁武.云计算与高等教育管理信息服务系统构建[J].山西财经大学学报,2012,34(S1). 被引量：9
10王荣荣.云计算技术基础上数字图书馆云服务平台的实现[J].河北北方学院学报（社会科学版）,2013,29(4):72-74. 被引量：2

同被引文献45

1崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(S1):12-18. 被引量：141
2董新华,李瑞轩,周湾湾,王聪,薛正元,廖东杰.Hadoop系统性能优化与功能增强综述[J].计算机研究与发展,2013,50(S2):1-15. 被引量：69
3张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量：120
4Wegener D, Mock W, Adranale D. Toolkit-based high-per- formance data mining of large data on MapReduce clusters [ C ]//IEEE International Conference on Data Mining Work- shops. 2009:296 - 301.
5Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [ M]. 北京:机械工业出版社,2010:89-120.
6Pera M S, Ng Y K. A naive Bayes classifier for Web docu- ment summaries created by using word similarity and signifi- cant factors [ J ]. International Journal on Artificial Intelli- gence Tools,2010,19 (4) :465 - 486.
7Malik H H, Fradkin D, Moerchen F. Single pass text classifi-cation by direct feature weighting [ J ]. Knowledge and Infor- mation Systems,2011,28 ( 1 ) :79 - 98.
8Salton G, Clement T Y. On the construction of effective vo- cabularies for information retrieval [ C ]//Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. 1973.
9How B C, Narayanan K. An empirical study of feature selec- tion for text categorization based on term weightage [ C ]// Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence. 2004:599 - 602.
10Chu C T, Kim S K, Lin Y A, et al. Map-reduce for machine learning on muhicore [ C ]//Proceedings of Neural Informa- tion Processing Systems Conference. 2006.

引证文献5

1赵文涛,孟令军,赵好好,韩炳权,成亚飞.分布式朴素贝叶斯算法在文本分类中的应用[J].测控技术,2016,35(6):50-55. 被引量：2
2吕峰,李丽娇,高云英,马开阳.基于Hadoop在中医药数据挖掘中的应用[J].电子设计工程,2016,24(22):112-114. 被引量：5
3黄刚,孙媛.基于Hadoop平台的SPRINT算法的分析与研究[J].南京师大学报（自然科学版）,2016,39(4):25-30. 被引量：2
4张元鸣,陈苗,陆佳炜,徐俊,肖刚.基于MapReduce的Bagging决策树优化算法[J].计算机工程与科学,2017,39(5):841-848. 被引量：8
5褚治广,颜飞,张兴,李畅.基于C4.5算法和Hadoop云计算平台的购物意愿分析方法[J].辽宁工业大学学报（自然科学版）,2017,37(4):225-229. 被引量：2

二级引证文献19

1车敏诗,聂春燕,范如俊,杨承金,阮新磊.一种基于混沌特征及优化CHAID决策树的情绪识别方法[J].计算机应用研究,2020,37(S02):105-107. 被引量：2
2康世瀛,胡小梅.基于HLA的中医药学多Agent仿真体系结构的复杂系统研究[J].重庆工商大学学报（自然科学版）,2017,34(4):76-83. 被引量：3
3邹晓辉.朴素贝叶斯算法在文本分类中的应用[J].数字技术与应用,2017,35(12):132-133. 被引量：13
4王苹,翟兴,唐燕.基于Hadoop的中医药大数据管理模型研究[J].中医药导报,2018,24(2):37-39. 被引量：3
5何薇,钱罕林.大数据和云计算平台应用分析[J].中国新通信,2018,20(2):121-121. 被引量：2
6曹家庆,吴观茂.基于MapReduce的分布式贪心EM算法[J].信息技术与网络安全,2018,37(5):84-87. 被引量：1
7梁杨,丁长松,于俊洋.基于Hadoop的中医药数据管理策略研究[J].中国中医药信息杂志,2018,25(5):96-100. 被引量：4
8王斌.基于朴素贝叶斯算法的垃圾邮件过滤系统的研究与实现[J].电子设计工程,2018,26(17):171-174. 被引量：9
9刘士伟,李丹.基于大数据分析的大学生创业风险评估算法设计[J].现代电子技术,2018,41(19):125-128. 被引量：6
10张雪梅.基于大数据的液压支架电液控制系统故障诊断[J].工矿自动化,2018,44(12):34-38. 被引量：28

1陆丽娜,陈亚萍,魏恒义,杨麦顺.挖掘关联规则中Apriori算法的研究[J].小型微型计算机系统,2000,21(9):940-943. 被引量：140
2LSI创新HA-DAS加强英特尔存储产品组合[J].计算机与网络,2012,38(18):75-75.
3微吹高尔夫球斥资上百万美元宣传Office 2003[J].电脑知识与技术（过刊）,2004(2):4-4.
4白秀玲,崔林,王向阳,彭宁嵩.关系数据库中关联规则的挖掘[J].电脑开发与应用,2002,15(10):5-6. 被引量：4
5徐妙君,谭小球.商业智能中的数据挖掘研究[J].浙江海洋学院学报（自然科学版）,2005,24(3):281-283. 被引量：2
6陈杨,方宝磊,张小华.基于图像置乱和ICA-DWT的数字图像水印算法[J].电子科技,2012,25(10):128-129. 被引量：2
7王文剑,于剑,高阳.前言[J].计算机研究与发展,2015,52(8):1705-1706. 被引量：4
8张贞梅.一种优化的频集发现算法[J].中国科技信息,2007(13):257-259. 被引量：1
9陆丽婷.浅谈数据挖掘和移动互联网[J].无线互联科技,2015,12(14):55-56.
10王焱林.数据挖掘中的关联规则算法研究[J].计算机光盘软件与应用,2014,17(18):126-126. 被引量：1

计算机技术与发展

2014年第11期

浏览历史

内容加载中请稍等...

基于Hadoop平台的C4.5算法的分析与研究被引量：5

参考文献12

二级参考文献61

共引文献1484

同被引文献45

引证文献5

二级引证文献19

相关作者

相关机构

相关主题

浏览历史

基于Hadoop平台的C4.5算法的分析与研究 被引量：5

参考文献12

二级参考文献61

共引文献1484

同被引文献45

引证文献5

二级引证文献19

相关作者

相关机构

相关主题

浏览历史

基于Hadoop平台的C4.5算法的分析与研究被引量：5