期刊文献+

WILD:基于加权信息损耗的离散化算法 被引量:8

WILD: A Discretization Algorithm Based on Weighted Information-Loss
下载PDF
导出
摘要 现实应用中常常涉及许多连续的数值属性 ,而目前许多机器学习算法则要求所处理的属性具有离散值 .基于信息论的基本原理 ,提出一种新的有监督离散化算法WILD ,该算它可以看成是决策树离散化算法的一种扩充 ,其主要改进在于考虑区间内观测值出现的频度 ,采用加权信息损耗作为区间离散化的测度 ,以克服决策树算法离散不均衡的问题 .该算法非常自然地采用了自底向上的区间归并方案 ,可以同时归并多个相邻区间 ,有利于提高离散化算法的速度 .实验结果表明该算法能够提高机器学习算法的精度 . Many existing machine learning algorithms expect the attributes to be discrete. However, discretization of attributes might be difficult even for domain expert. This paper proposed a new discretization algorithm called WILD, which stands for Weighted Information Loss Discretization. This algorithm can be considered as an extended counterpart of Decision Tree Discretization algorithm. Firstly, WILD assumes that the attribute A to be discretized is ordinal, and initial intervals can be formed from different values of the attribute in the original data set, so as to each initial interval contains exactly one attribute value. Secondly, WILD algorithm uses a bottom up paradigm as in ChiMerge algorithm. Based on initial intervals, WILD repeatedly calculates some measure for every group of m adjacent intervals (m is a user specified parameter), and merges the group with the lowest measure, until some stopping criterion is satisfied. Thirdly, the measure in WILD is related to the damage associated with the merging process for every group of m adjacent intervals. The main improvement in WILD lies on the fact that weighted information loss is used as a measure as opposed to information gain in Decision Tree Discretization, and this adaptation seems more natural and easier to be implemented in a bottom up paradigm than in a top down paradigm. It should be noted that if the considered measure when merging is information loss, and the number of adjacent intervals for merging is set to 2, WILD can be thought of as the counterpart of Decision Tree Discretization algorithm. Actually, Decision Tree Discretization algorithm tries to separate intervals when much information can be gained, whereas, WILD tries to merge adjacent intervals when the information loss is less. WILD algorithm has two advantages. First, it can improve the speed of discretization since it can merge several intervals at a time rather than just two. Secondy, it uses weighted information loss to overcome the deficiencies of Decision Tree Discretization algorithm. In order to evaluate the performance of WILD algorithm, both WILD and decision tree discretization algorithm are implemented as a preprocessing step to a Naive Bayes classifier. So the predication accuracy of this classifier can reflect the relative performance of both discretization methods. The empirical results indicate that WILD is a promising discretization algorithm.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2001年第2期148-153,共6页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金! ( 69873 0 3 1 )
关键词 机器学习 离散化 加权信息损耗 WILD 决策树 有监督算法 machine learning, discretization, entropy
  • 相关文献

参考文献5

  • 1[1]Catlett J. On Changing Continuous Attributes into Ordered Discrete Attributes. Proceedings of EuropeanWorking Session on Learning (EWSL91). LNAI-482, Berlin: Springer-Verlag, 1991:164~ 178.
  • 2[2]Dougherty J, Kohavi R, Sahami M. Supervised and unsupervised discretization of continuous features.Prieditis A. Machine Learning: Proceedings of the 12th International Conference. San Mateo: MorganKaufmann Publishers, 1995: 194~202.
  • 3[3]Fayyad U, Irani K. Multi-interval discretizaton of continuous-valued attributes for classification learning.Proceedings of the 13th International Joint Conference on Artificial Intelligence. San Mateo: Morgan Kaufmann Publishers, 1993:1 022~1 027.
  • 4[4]Kerber R C. Discretization of Numeric Attributes. Proceedings of the 10th National Conference on ArtificialIntelligence. MIT Press, 1992: 123~ 128.
  • 5[5]Kohavi R. MLC+ +: A Machine Learning Library in C+ +. Tools With Artificial Intelligence. IEEEComputer Society Press, 1994: 740~ 743.

同被引文献39

引证文献8

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部