期刊文献+

MapReduce实现的改进决策树 被引量:2

Improved decision tree algorithm implemented by MapReduce
下载PDF
导出
摘要 针对某些数据集中不同属性对类别产生的影响程度有所不同,提出了属性权值的概念,用于改进C4.5决策树算法。通过计算得到不同属性在分类过程中对类别的重要程度,分别赋予不同属性不同的权值,以不同权值计算属性信息增益率,从而找到最佳决策属性;同时将算法运行在HDFS集群,通过Hadoop平台控制多台计算机同时处理待分类数据集,以并行的方式构造决策树。实验结果表明,改进的C4.5算法在处理不同属性对分类结果影响程度不同的数据时比传统C4.5算法具有更高的准确率,并且由于程序并行运行,能够更加高效地处理大型数据,具有很好的可扩展性。 In view of the different degrees of different attributes on categories in some data sets,the concept of attribute weight is proposed to improve the C4.5 decision tree algorithm.The importance of different attributes to the category in the classification process is obtained by calculation,different weights are assigned to different attributes,and different weights are substituted into the formula for calculating the attribute information gain rate,so as to find the best decision attribute.At the same time,the algorithm is run in HDFS cluster,and multiple computers are controlled to process classified data sets simultaneously through Hadoop platform,and the decision tree is constructed in parallel.The experimental results show that the advanced C4.5 decision tree algorithm is more accurate than the traditional C4.5 algorithm in processing the data with different effects of different attributes on the classification results,and because the program runs in parallel,it can process mass data more efficiently.
作者 柴志远 王小妮 CHAI Zhiyuan;WANG Xiaoni(School of Applied Sciences,Beijing Information Science&Technology University,Beijing 100192,China)
出处 《北京信息科技大学学报(自然科学版)》 2020年第6期14-18,共5页 Journal of Beijing Information Science and Technology University
基金 国家自然科学基金资助项目(61604014)。
关键词 C4.5算法 权值 HDFS集群 准确率 运行时间 大型数据 C4.5 algorithm weights HDFS cluster accuracy running time mass data
  • 相关文献

参考文献8

二级参考文献41

共引文献188

同被引文献14

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部