摘要
特征选择在文本分类中起重要的作用。文档频率(DF)、信息增益(IG)和互信息(MI)等特征选择方法在文本分类中广泛应用。已有的实验结果表明,IG是最有效的特征选择算法之一,该方法基于申农提出的信息论。本文基于粗糙集理论,提出了一种新的特征选择方法(KG算法),该方法依据粗糙集理论关于知识的观点,即知识是分类事物的能力,将知识进行量化,提出知识增益的概念,得到基于知识增益的特征选择方法。在两个通用的语料集OHSUMED和NewsGroup上进行分类实验发现:KG算法均超过IG的性能,特别是在特征空间的维数降到低维时尤其明显,可见KG算法有较好的性能;
Feature selection(FS) plays an important role methods such as document frequency thresholding (DF), n text categorization(TC). Automatic feature selection nformation gain (IG), mutual information (MI), and so on are commonly applied in text categorization [J]. Existing experiments show IG is one of the most effective methods. In this paper, a feature selection method is proposed based on Rough Set theory. According to Rough set theory, knowledge about a universe of objects may be defined as classifications based on certain properties of the objects, i.e. rough set theory assume that knowledge is an ability to partition objects. We quantify the ability of classify objects, and call the amount of this ability as knowledge quantity and then following this quantification, we put forward a notion "knowledge Gain" and put forward a knowledge gain-based feature selection method(KG method). Experiments on NewsGroup collection and OHSUMEI) collection show that KG performs better than the IG method, specially, on extremely aggressive reduction.
出处
《中文信息学报》
CSCD
北大核心
2008年第1期44-50,共7页
Journal of Chinese Information Processing
基金
国家973资助项目(2004CB318109)
国家自然科学基金资金项目(60473002,60603094)
北京市自然科学基金资助项目(4051004)
关键词
计算机应用
中文信息处理
文本分类
特征选择
粗糙集
信息检索
computer application
Chinese information processing
feature selection
text categorization
rough set
information retrieval