期刊文献+

知识增益:文本分类中一种新的特征选择方法 被引量:6

Knowledge Gain:An New Feature Selection Method in Text Categorization
下载PDF
导出
摘要 特征选择在文本分类中起重要的作用。文档频率(DF)、信息增益(IG)和互信息(MI)等特征选择方法在文本分类中广泛应用。已有的实验结果表明,IG是最有效的特征选择算法之一,该方法基于申农提出的信息论。本文基于粗糙集理论,提出了一种新的特征选择方法(KG算法),该方法依据粗糙集理论关于知识的观点,即知识是分类事物的能力,将知识进行量化,提出知识增益的概念,得到基于知识增益的特征选择方法。在两个通用的语料集OHSUMED和NewsGroup上进行分类实验发现:KG算法均超过IG的性能,特别是在特征空间的维数降到低维时尤其明显,可见KG算法有较好的性能; Feature selection(FS) plays an important role methods such as document frequency thresholding (DF), n text categorization(TC). Automatic feature selection nformation gain (IG), mutual information (MI), and so on are commonly applied in text categorization [J]. Existing experiments show IG is one of the most effective methods. In this paper, a feature selection method is proposed based on Rough Set theory. According to Rough set theory, knowledge about a universe of objects may be defined as classifications based on certain properties of the objects, i.e. rough set theory assume that knowledge is an ability to partition objects. We quantify the ability of classify objects, and call the amount of this ability as knowledge quantity and then following this quantification, we put forward a notion "knowledge Gain" and put forward a knowledge gain-based feature selection method(KG method). Experiments on NewsGroup collection and OHSUMEI) collection show that KG performs better than the IG method, specially, on extremely aggressive reduction.
出处 《中文信息学报》 CSCD 北大核心 2008年第1期44-50,共7页 Journal of Chinese Information Processing
基金 国家973资助项目(2004CB318109) 国家自然科学基金资金项目(60473002,60603094) 北京市自然科学基金资助项目(4051004)
关键词 计算机应用 中文信息处理 文本分类 特征选择 粗糙集 信息检索 computer application Chinese information processing feature selection text categorization rough set information retrieval
  • 相关文献

参考文献13

二级参考文献44

  • 1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 2冯是聪 单松巍 张志刚 等.一个中文网页数据集及其分类体系[A]..海峡两岸技术交流会[C].南京,2002-10.121-129.
  • 3黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 4[5]Starzyk J, Nelson D E, Sturtz K. Reducts. A mathematical foundation for improved reduct generation in information systems. Journal of Knowledge and Information Systems, 2000, 2(2):131~146
  • 5[6]Bazan J G, Skowron A, Synak P. Dynamic reducts as a tool for extracting laws from decisions tables. In: Ras Z W, Zemankiva M eds. Methodologies for Intelligent Systems. Berlin: Springer-Verlag,1994. 346~355
  • 6[7]Ziarko W. Variable precision rough sets model. Journal of Computer and Systems Sciences, 1993, 46(1):39~59
  • 7[8]Pawlak Z. Grzymala-Busse J, Slowinski R etal. Rough sets.Communications of the ACM, 1995, 38(11): 89~95
  • 8[11]Ying Wu, Thomas S Huang. Hand moeling, analysis, and recognition. IEEE Signal Processing Magazine, 2001(5):51~60
  • 9[12]Lin J, Wu Y, Huang T S. Modeling human hand constraint. In: Proceedings of Workshop on Human Motion. Austin, Texas USA,2000. 121~126
  • 10[1]Pawlak Z. Rough sets. International Journal of Computer and Information Science, 1982, 11(5): 341~356

共引文献366

同被引文献73

引证文献6

二级引证文献43

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部