期刊文献+

结合优化文档频和变精度粗糙集的特征选择方法 被引量:1

Feature Selection Method Combining Optimized Document Frequency with Variable Precision Rough Sets
下载PDF
导出
摘要 在文本分类中,特征空间的维数通常高达几万,甚至远远超出训练样本的个数,这是一种十分普遍的现象.为了提高文本挖掘算法的运行速度,降低占用的内存空间,过滤掉不相关或相关程度低的特征,必须使用特征选择算法.首先给出了一个基于最小词频的文档频方法,然后把变精度粗糙集引入进来并提出了一个基于信息熵的属性约简算法,最后把该属性约简算法同基于最小词频的文档频方法结合起来,提出了一个综合的特征选择算法.该综合算法首先利用基于最小词频的文档频方法进行特征选择,然后利用所提属性约简算法消除冗余,从而获得较具代表性的特征子集.实验结果表明,该算法比最好的3种经典特征选择方法"互信息"和"统计量"以及文档频都要好. In text categorization, one problem is usually confronted with feature spaces containing 10,000 dimensions and more, even exceeding the number of available training samples. In order to enhance the operating speed and reduce the memory space occupied and filter out irrelevant or lower degree of features, feature selection algorithms must be used. In order to obtain more representative feature subset, it firstly presented document frequency method based on minimum word frequency, and then introduced variable precision rough sets and presented an algorithm of attribute reduction based on information entropy. Finally, the study combined the attribute reduction algorithm with document frequency method based on minimum word frequency and proposed a comprehensive feature selection algorithm. The comprehensive algorithm firstly used document frequency method based on minimum word frequency to select features, and then the attribute reduction algorithm to eliminate redundancy. Experimental results show that the comprehensive algorithm is better than Mutual Information and Chi-square Statistic and document frequency which are three best conventional feature selection measures.
作者 朱颢东 钟勇
出处 《河南大学学报(自然科学版)》 CAS 北大核心 2009年第5期515-520,共6页 Journal of Henan University:Natural Science
基金 四川省科技计划项目(2008GZ0003) 四川省科技厅科技攻关项目(07GG006-014)
关键词 特征选择 最小词频 文档频 变精度粗糙集 信息熵 属性约简 feature selection minimum word frequency document frequency variable precision rough set information entropy attribute reduction
  • 相关文献

参考文献10

二级参考文献36

  • 1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 2寇莎莎,魏振军.自动文本分类中权值公式的改进[J].计算机工程与设计,2005,26(6):1616-1618. 被引量:25
  • 3邹娟,周经野,邓成,刘玲.基于多重启发式规则的中文文本特征值提取方法[J].计算机工程与科学,2006,28(8):78-80. 被引量:3
  • 4李水平,小型微型计算机系统,1998年,19卷,4期,74页
  • 5Salton G,Commun ACM,1975年,18卷,11期,613页
  • 6Yang Yiming, Pederson Jan O. A comparative study on feature selection in text categorization [A]. Proceedings of the 14th International Conference on Machine learning[C]. Bled: Morgan Kaufmann, 1997: 258-267.
  • 7Liu Tao, Liu Shengping, Chen Zheng. An evaluation on feature selection for text clustering [A]. Proceedings of the 20th International Conference on Machine learning[C]. Washington DC:2003.
  • 8Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
  • 9Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
  • 10Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.

共引文献419

同被引文献14

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部