摘要
分类问题是数据挖掘和机器学习中的一个核心问题。为了得到最大程度的分类准确率,决策树分类过程中,非常关键的是结点分裂属性的选择。常见的分裂结点属性选择方法可以分为信息熵方法、GINI系数方法等。分析了目前常见的选择分裂属性方法——基于信息熵方法的优、缺点,提出了基于卡方检验的决策树分裂属性的选择方法,用真实例子和设置模拟实验说明了文中算法的优越性。实验结果显示文中算法在分类错误率方面好于以信息熵为基础的方法。
Classlfication is an important issue on data mining and machine learning. Selecting splitting attributes is the key process during constructing decision tree for rcceiving the maximized classification accuracy. Existing methods for classification usually can be the method based on entroy, GINI index, and so on. Analyses the disadvantages and the advantages of the method which is utilized to select splitting attributes based on information gain theory, and proposes a statistical method which employs chi - squared test to get the relation between the condition attributes and the class label. Demonstrate experimental this algorithm and the results show this method is significantly well than the methods based on information theory.
出处
《计算机技术与发展》
2008年第5期70-72,共3页
Computer Technology and Development
基金
广西自然科学基金(桂科0640069)
关键词
决策树
分裂属性
卡方检验
信息熵
decision trees
splitting attributes
Chi-squared test
information entropy