摘要
连续数据离散化是数据挖掘分类方法中的重要预处理过程。本文提出一种基于最小描述长度原理的均衡离散化方法,该方法基于最小描述长度理论提出一种均衡的离散化函数,很好地衡量了离散区间与分类错误之间的关系。同时,基于均衡函数提出一种有效的启发式算法,寻找最佳的断点序列。仿真结果表明,在C5.0决策树和Naive贝叶斯分类器上,提出的算法有较好的分类学习能力。
Discretization of continuous data is an important preprocess of classification methods in data mining.This paper presents a balanced discretization algorithm based on the minimum description length principle.It well measures the relationship between the discretized interval and classification errors by proposing a balanced discretization function based on the minimum description length.The approach proposes an effective heuristic discretization algorithm with the aim to find the optimal breakpoint sequence.The simulation results demonstrate that the proposed algorithm achieves more classification and learning ability on the C5.0 decision tree and the naive Bayesian classifier.
出处
《计算机工程与科学》
CSCD
北大核心
2011年第12期130-135,共6页
Computer Engineering & Science
基金
宜宾学院校基金资助项目(2010Z10)
关键词
离散化
数据挖掘
最小描述长度
均衡函数
discretization
data mining
minimum description length(MDL)
balanced function