摘要
常用文本分类特征选择算法主要通过某种评价函数来计算单个特征对类别的区分能力,由于仅考虑了特征和类别之间的关联性,忽略了特征与特征之间的相关性,从而导致特征集存在冗余。针对这一问题,提出了一种新的用于文本分类的特征选择算法,该算法可以帮助选出类别区分能力强、特征之间关联性弱的特征。实验证实,该算法的性能要优于传统的特征选择算法。
At present,most of the feature selection algorithm is through some kind of evaluation function to calculate the individual characteristics of the distinction between categories of capacity.For the reason that it merely having considered the relevance between characteristic and category with ignoring the relevance among characteristic themselves,this leads to the redundancy in feature set.In consideration of this problem,this article put forward a new feature selection algorithm in the use of text categorization.This algorithm helped to select the characteristics with strong ability to distinguish category and weak relevance among characteristics.The experimental proves that this method has better performance than the traditional feature selection algorithm.
出处
《计算机应用研究》
CSCD
北大核心
2011年第6期2099-2101,共3页
Application Research of Computers
基金
国家自然科学基金资助项目(70971059)
辽宁省创新团队资助项目(2009T045)
辽宁省科技攻关资助项目(2007308003)
关键词
文本分类
特征选择
模糊相关
冗余性
text classification
feature selection
fuzzy related
redundancy