摘要
商业智能分析诸多算法是基于离散化数据的,但商业分析的中数据类型不一,将连续属性离散化是商业智能分析中数据预处理中非常重要的内容之一。通过对连续属性的分布特征和不同类别在同一属性下的分布特点分析,提出基于正态分布特征的连续属性无监督离散化方法,并研究了经该离散化方法对连续属性数据预处理后测试数据分类精度与断点个数设置之间的关系,确定统计意义上较为合理的断点个数,实现对连续数据的离散化处理。数值对比实验结果表明:本文所提出的离散化方法在一定程度上可以提高数据集分类精度。
The discrete data is used to the vast majority of research methods of data mining.So it is necessary to discretize the continuous data as a part work of data preprocessing.This paper analy sis a new unsupervised discretization of continuous attributes based on normal distribution characteristics through the normal distribution characteristics and the distribution of different categories in the same attribution. After that,we study the relationship between the classify accuracy of the testing data and the setting number of the cut-points,and we find the logical number of the cut-points.F inally,the experiments show that the method can improve the classify accuracy of the testing datasets.
出处
《科学与管理》
2009年第6X期5-8,共4页
Science and Management
关键词
正态分布
连续属性
离散化
数据挖掘
The Normal Distribution
Continuous Attribute
Discretization Method
Data Mining