摘要
很多数据挖掘方法只能处理离散值的属性,因此,连续属性必须进行离散化。提出一种统计相关系数的数据离散化方法,基于统计相关理论有效地捕获了类属性间的相互依赖,选取最佳断点。此外,将变精度粗糙集(VPRS)模型纳入离散化中,有效地控制数据的信息丢失。将所提方法在乳腺癌症诊断以及其他领域数据上进行了应用,实验结果表明,该方法显著地提高了See5决策树的分类学习精度。
Most data mining and induction learning methods can only deal with discrete attributes;therefore,discretization of continuous attributes is necessary.The author proposed a data discretization method based on statistical correlation coefficient.The method captured the interdependence between attributes and target class with the aim to select optimal cut points based on statistical correlation theory.In addition,the author incorporated Variable Precision Rough Set(VPRS) model to effectively control information loss.The proposed method was applied to breast tumor diagnosis and data of other fields.The experimental results show that this method significantly enhances the accuracy of classification of See5.
出处
《计算机应用》
CSCD
北大核心
2011年第5期1409-1412,共4页
journal of Computer Applications
关键词
离散化
数据挖掘
类属性相互依赖
变精度粗糙集
决策树
discretization
data mining
Class-Attribute Interdependence(CAI)
Variable Precision Rough Set(VPRS)
decision tree