摘要
特征选择是文本分类中的核心研究课题之一。简单分析了词频和文档频,在此基础上提出了类内集中度,把集合覆盖的思想引入粗糙集并提出了一个基于最小集合覆盖的属性约简算法,把该属性约简算法同类内集中度结合起来,提出了一个新的特征选择方法。该方法利用类内集中度进行特征初选以过滤掉一些词条来降低特征空间的稀疏性,利用所提约简算法消除冗余,从而获得较具代表性的特征子集。实验结果表明此种特征选择方法效果良好。
Feature selection is one of the core research topics in text categorization.Word frequency and document frequency are analyzed simply.Category concentration based on word frequency and document frequency is presented.Set covering is in- troduced into rough sets and an attribute reduction algorithm based on minimal set covering is provided.A new feature selec- tion method combined the provided attribute reduction algorithm with the category concentration is proposed.The new method uses the category concentration to select feature and filter out some terms to reduce the sparsity of feature spaces,and then employs the proposed attribute reduction algorithm to eliminate redundancy, so that the more representative feature subset is acquired.The experimental results show that the new method is promising.
出处
《计算机工程与应用》
CSCD
北大核心
2011年第28期124-127,共4页
Computer Engineering and Applications
基金
河南省基础与前沿技术研究计划项目(No.102300410266)
关键词
特征选择
文本分类
词频
文档频
粗糙集
属性约简
feature selection
text categorization
word frequency
document frequency
rough sets
attribute reduction