期刊文献+

基于训练集局部加权的C4.5算法改进研究

A Algorithm of Improvement forC4.5Based on Training SetWeightedLocally
下载PDF
导出
摘要 C4.5算法采用信息增益率来构造决策树,克服了选择较多值的属性的趋向,具有处理连续属性的能力。在处理大数据集时,表现出效率较低,忽略样本集中的不同样本与测试数据的距离差异。该文提出了一种基于训练集局部加权的C4.5改进算法,根据欧式距离或汉明距离来定义样本的权值,将权值更新到训练集中,重新计算的信息增益率反映了训练样本的差异对测试数据的影响,并且在处理大数据集时,根据权值排序和设置的阈值简化数据集,降低了计算复杂度,提高效率。 C4.5 algorithm uses information gain-ratio to construct a decision tree, and overcome the tendency to select the attri- bute onmore values, with the ability to handle continuous attributes.But it showless efficient when dealing with large data sets and ignoring the differences of distance from the sample set and test data set.Based on training set weighted locally, This paper proposes a suite of algorithm of improvement for C4.5algorithm.The sample weights ,which are defined according to the Euclid- ean distance or Hamming distance, update to the training set.On this basis, information gain-ratio which is recalculated reflects the impact of the differences of distance from the sample set and test data set.Therefore, the proposed algorithm can reduces the computational complexity and improves efficiencywhen dealing with large data sets,using the simplifiedsample set based on- weight sorting and the threshold.
作者 张扬武 ZHANG Yang-wu (Department of Teachingfor Science and Technology, China University of Political Science and Law, Beijing 102249, China)
出处 《电脑知识与技术》 2016年第6期202-204,共3页 Computer Knowledge and Technology
关键词 C4.5 信息增益比 局部加权 数据集 邻近距离 C4.5 information gain-ratio weighted locally data set near distance
  • 相关文献

参考文献6

  • 1Witten IH,Frank E .Data Mining: Practical Machine Learning Toolsand Techniques[M].2nd ed., San Francisco ;Elsevier Inc., 2005.
  • 2AlmuallimH .On handling tree-structured attributes[C]// AshburnerM . Proc of the 12th IntConf on Machine Learning. San Fransisco :Morgan Kaufmann , 1995. 12-20.
  • 3Wu Xindong,Kumar V,Quinlan J .Top 10 algorithms in data mining[J].Knowledge and Information Systems, 2008,14(1):1- 37.
  • 4Moore AW, Zuev D, Crogan M. Discriminators for use in flow-based classification[R]. Technical Report, RR-05-13, London :Queen Mary University of London,2005.
  • 5Pawlak Z D,Quinlan J .Rough set theory and its'applicationto data analysis[J].Cybernetics and Systems, 1998,29(9):611-668.
  • 6Ghosh A K, Chaudhuri P,Murthy C A. Multiscale classification using nearest neighbor density estimates[J]. IEEE Transactionson Systems,Man,and Cybernetics,PartB:Cybernetics, 2006, 36(5):1139-1148.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部