摘要
在文本分类中,特征空间的维数通常高达几万,甚至远远超出训练样本的个数,这是一种十分普遍的现象.为了提高文本挖掘算法的运行速度,降低占用的内存空间,过滤掉不相关或相关程度低的特征,必须使用特征选择算法.首先给出了一个基于最小词频的文档频方法,然后把变精度粗糙集引入进来并提出了一个基于信息熵的属性约简算法,最后把该属性约简算法同基于最小词频的文档频方法结合起来,提出了一个综合的特征选择算法.该综合算法首先利用基于最小词频的文档频方法进行特征选择,然后利用所提属性约简算法消除冗余,从而获得较具代表性的特征子集.实验结果表明,该算法比最好的3种经典特征选择方法"互信息"和"统计量"以及文档频都要好.
In text categorization, one problem is usually confronted with feature spaces containing 10,000 dimensions and more, even exceeding the number of available training samples. In order to enhance the operating speed and reduce the memory space occupied and filter out irrelevant or lower degree of features, feature selection algorithms must be used. In order to obtain more representative feature subset, it firstly presented document frequency method based on minimum word frequency, and then introduced variable precision rough sets and presented an algorithm of attribute reduction based on information entropy. Finally, the study combined the attribute reduction algorithm with document frequency method based on minimum word frequency and proposed a comprehensive feature selection algorithm. The comprehensive algorithm firstly used document frequency method based on minimum word frequency to select features, and then the attribute reduction algorithm to eliminate redundancy. Experimental results show that the comprehensive algorithm is better than Mutual Information and Chi-square Statistic and document frequency which are three best conventional feature selection measures.
出处
《河南大学学报(自然科学版)》
CAS
北大核心
2009年第5期515-520,共6页
Journal of Henan University:Natural Science
基金
四川省科技计划项目(2008GZ0003)
四川省科技厅科技攻关项目(07GG006-014)
关键词
特征选择
最小词频
文档频
变精度粗糙集
信息熵
属性约简
feature selection
minimum word frequency
document frequency
variable precision rough set
information entropy
attribute reduction