摘要
特征选择是文本分类的一个核心研究课题.首先给出了一个基于最小词频的文档频方法,然后把粗糙集引入进来并提出了一个属性约简算法,最后把该属性约简算法同基于最小词频的文档频方法结合起来,提出了一个综合的特征选择方法.该综合方法首先使用基于最小词频的文档频方法进行特征初选以过滤掉一些词条来降低特征空间的稀疏性,然后利用所提属性约简算法消除冗余,从而获得较具代表性的特征子集.
Feature selection is the core research topic in text categorization. Firstly, a document frequency method based on minimum word frequency is presented. And then, rough sets are introduced and an attribute reduction algorithm is provided. Finally, the attribute reduction algorithm is combined with the document frequency method based on minimum word frequency and a comprehensive feature selection method is proposed. The comprehensive method firstly uses the document frequency method based on minimum word frequency to select feature and filter out some terms to reduce the sparsity of feature spaces, and then employs the attribute reduction algorithm to eliminate redundancy, so that the feature subset which are more representative is acquired.
出处
《湖南师范大学自然科学学报》
CAS
北大核心
2009年第3期27-31,共5页
Journal of Natural Science of Hunan Normal University
基金
四川省科技计划资助项目(2008GZ0003)
四川省科技厅科技攻关资助项目(07GG006-014)
关键词
文本分类
词频
文档频
属性约简
粗糙集
text categorization
minimum word frequency
document frequency
attribute reduction
rough set