摘要
ID3算法是决策树归纳中普遍而有效的启发式算法。本文针对ID3算法的不足,给出了一个改进版本,它在选择测试属性时不仅要求该属性和类的交互信息较大,而且要求和祖先结点使用过的属性之间的交互性息尽可能小,从而避免了对冗余属性的选择,实现信息熵的真正减少。在生成树的过程中,设定分类阈值,对树进行剪枝,以避免数据子集过小,使进一步划分失去统计意义。实验结果表明,该算法能构造出比ID3算法更优的决策树。
ID3 algorithm is a popular and efficient heuristic algorithm in decision tree induction. This paper analyzes the shortcomings of the ID3 algorithm and proposes an extended version in which the testing attributes is selected based on not only the more mutual information between a candidate attribute and the class but also the less mutual information between a candidate attribute and the attribute of its ancestor nodes, in order to avoid selecting the redundant attributes and achieve the real reduce in entropy. And in the process of building tree, prune the tree with a pre-specified threshold, to avoiding the subset of instances is too small and loses the statistical significance of further divided. The experimental result indicates that it can construct a better decision tree compared with ID3.
出处
《贵州大学学报(自然科学版)》
2008年第5期494-497,共4页
Journal of Guizhou University:Natural Sciences
关键词
ID3交互信息
预剪枝
ID3 mutual
information
pre-pruning