摘要
选择性分类器通过删除数据集中的无关属性和冗余属性可以有效地提高分类精度和效率.因此,一些选择性分类器应运而生.然而,由于处理不完整数据的复杂性,它们大都是针对完整数据的.由于各种原因,现实中的数据通常是不完整的并且包含许多冗余属性或无关属性.如同完整数据的情形一样,不完整数据集中的冗余属性或无关属性也会使分类性能大幅下降.因此,对用于不完整数据的选择性分类器的研究是一项重要的研究课题.通过分析以往在分类过程中对不完整数据的处理方法,提出了两种用于不完整数据的选择性贝叶斯分类器:SRBC和CBSRBC.SRBC是基于一种鲁棒贝叶斯分类器构建的,而CBSRBC则是在SRBC基础上利用χ2统计量构建的.在12个标准的不完整数据集上的实验结果表明,这两种方法在大幅度减少属性数目的同时,能显著提高分类准确率和稳定性.从总体上来讲,CBSRBC在分类精度、运行效率等方面都优于SRBC算法,而SRBC需要预先指定的阈值要少一些.
Selective classifiers have been proved to be a kind of algorithms that can effectively improve the accuracy and efficiency of classification by deleting irrelevant or redundant attributes of a data set. Though some selective classifiers have been proposed, most of them deal with complete data, which is due to the complexity of dealing with incomplete data. Yet actual data sets are often incomplete and have many redundant or irrelevant attributes because of various kinds of reason. Similar to the case of complete data, irrelevant or redundant attributes of an incomplete data set can also sharply reduce the accuracy of a classifier established on this data set. So constructing selective classifiers for incomplete data is an important problem. With the analysis of main methods of processing incomplete data for classification, two selective Bayes classifiers for incomplete data, which are denoted as SRBC and CBSRBC respectively, are presented. While SRBC is constructed by using the robust Bayes classifiers, CBSRBC is based on SRBC and chisquared statistics. Experiments on twelve benchmark incomplete data sets show that these two algorithms can not only enormously reduce the number of attributes, but also greatly improve the accuracy and stability of classification as well. On the whole, CBSRBC is more efficient than SRBC and its classification accuracy is higher than that of SRBC. But some thresholds necessary to CBSRBC can be avoided by SRBC.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2007年第8期1324-1330,共7页
Journal of Computer Research and Development
基金
国家自然科学基金项目(60503017
60673089)
关键词
贝叶斯方法
分类
特征选择
不完整数据
X2统计量
Bayesian method
classification
feature selection
incomplete data
chi-squared statistics