摘要
为了减少连续属性离散化后有用信息的丢失和信息系统总的断点数量,提出了一种具有全局聚类效果的多属性离散化算法.算法根据各属性预插入断点对信息系统近似分类质量的影响,来确定要插入断点的属性,从全局属性范围选择最佳断点.根据Ameva统计量来判断属性中最佳断点的位置,并以保证决策表的近似分类质量作为算法的终止条件.实验采用多组机器学习数据对算法的性能进行了检验,并与几种经典算法做了对比.实验结果表明,用新的离散化算法获得的结果所建的C45决策树分类模型,具有较好的分类精度和较少的节点数量.
To avoid information loss and cut points decrease after discretization of continuous attributes,a synchronized continuous attribute discretization algorithm with good global clustering effect for selecting cut points from all conditions attributes is presented.This algorithm decides which continuous attribute should be inserted according to the cut point from all attributes based on the influence of the inserted cut point.The influence is evaluated by information system approximation classification quality.Then cut point is selected from the candidate points in the attribute according to Ameva statistics,and the level of indiscernibility relation is chosen as the stopping condition of the algorithm.By UCI machine learning data sets a comparison with several classic discretization algorithms shows that the C45 classification model based on the proposed algorithm is of good classification accuracy and needs less nodes.
出处
《西安交通大学学报》
EI
CAS
CSCD
北大核心
2011年第9期1-5,共5页
Journal of Xi'an Jiaotong University
基金
国家自然科学基金资助项目(51105296)
机械制造系统工程国家重点实验室开放课题资助项目
中央高校基本科研业务费专项资金资助项目
关键词
统计量
连续属性
离散化
statistics
continuous attributes
discretization