摘要
现实应用中常常涉及许多连续的数值属性 ,而目前许多机器学习算法则要求所处理的属性具有离散值 .基于信息论的基本原理 ,提出一种新的有监督离散化算法WILD ,该算它可以看成是决策树离散化算法的一种扩充 ,其主要改进在于考虑区间内观测值出现的频度 ,采用加权信息损耗作为区间离散化的测度 ,以克服决策树算法离散不均衡的问题 .该算法非常自然地采用了自底向上的区间归并方案 ,可以同时归并多个相邻区间 ,有利于提高离散化算法的速度 .实验结果表明该算法能够提高机器学习算法的精度 .
Many existing machine learning algorithms expect the attributes to be discrete. However, discretization of attributes might be difficult even for domain expert. This paper proposed a new discretization algorithm called WILD, which stands for Weighted Information Loss Discretization. This algorithm can be considered as an extended counterpart of Decision Tree Discretization algorithm. Firstly, WILD assumes that the attribute A to be discretized is ordinal, and initial intervals can be formed from different values of the attribute in the original data set, so as to each initial interval contains exactly one attribute value. Secondly, WILD algorithm uses a bottom up paradigm as in ChiMerge algorithm. Based on initial intervals, WILD repeatedly calculates some measure for every group of m adjacent intervals (m is a user specified parameter), and merges the group with the lowest measure, until some stopping criterion is satisfied. Thirdly, the measure in WILD is related to the damage associated with the merging process for every group of m adjacent intervals. The main improvement in WILD lies on the fact that weighted information loss is used as a measure as opposed to information gain in Decision Tree Discretization, and this adaptation seems more natural and easier to be implemented in a bottom up paradigm than in a top down paradigm. It should be noted that if the considered measure when merging is information loss, and the number of adjacent intervals for merging is set to 2, WILD can be thought of as the counterpart of Decision Tree Discretization algorithm. Actually, Decision Tree Discretization algorithm tries to separate intervals when much information can be gained, whereas, WILD tries to merge adjacent intervals when the information loss is less. WILD algorithm has two advantages. First, it can improve the speed of discretization since it can merge several intervals at a time rather than just two. Secondy, it uses weighted information loss to overcome the deficiencies of Decision Tree Discretization algorithm. In order to evaluate the performance of WILD algorithm, both WILD and decision tree discretization algorithm are implemented as a preprocessing step to a Naive Bayes classifier. So the predication accuracy of this classifier can reflect the relative performance of both discretization methods. The empirical results indicate that WILD is a promising discretization algorithm.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2001年第2期148-153,共6页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金! ( 69873 0 3 1 )