摘要
提出一种基于概率与信息熵理论的实值属性离散化方法,综合考虑了各对合并区间之间的差异性;该方法利用信息熵衡量相邻区间的相似性,同时考虑离散区间大小和区间类别数对学习精度的影响,并通过概率的方法得到了这两个因素的衡量标准。仿真结果表明,新方法对See5/C5.0分类器有较好的分类学习能力,并在肿瘤诊断中得到了很好的应用。
This paper presents a diseretization method for real attributes based on probability and information entropy,namch PIE, which synthetically considers the variance among the merged intervals. This method mcasures the similarity of each imerval intervals by using information entropy and takes into account the effect of the discrete interval size and class nnmber of each interval on learning accuracy, and the measurement of two fae.tors is achieved with probabilistic means. Simulation results show that PIE eam yield more classification and learning accuracy by running See5/C5.0 classifier and has better application on tumot diagnosis.
出处
《微型机与应用》
2011年第15期68-70,77,共4页
Microcomputer & Its Applications
关键词
离散化
数据挖掘
概率
信息熵
diseretization
data mining
probabilily
information entropy