摘要
针对不平衡数据集上进行文本分类,传统的特征选择方法容易导致分类器倾向于大类而忽视小类,提出一种新的特征选择方法 IPR(integrated probability ratio)。该方法综合考虑特征在正类和负类中的分布性质,结合四种衡量特征类别相关性的指标对特征词进行评分,能够更好地解决传统特征选择方法在不平衡数据集上的不适应性,在不降低大类分类性能的同时提高了小类的识别率。实验结果表明,该方法有效可行。
Handing unbalanced data sets in text classification, the traditional feature selection approach more likely tends to large categories and neglects sub-categories. To tackle this problem, this paper proposed a new feature selection approach IPR. This approach considered the distribution property of feature between the positive class and negative class, combined four measure indicators for features with categories distinguishing ability, this approach had solved the problem which traditional fea- ture selection was not adaptive to unbalanced data set and improving the recognition rate of sub-categories, but hadn' t reduced performance of the large categories. Experimental result shows that it is an effective and feasible feature selection approach.
出处
《计算机应用研究》
CSCD
北大核心
2011年第12期4532-4534,共3页
Application Research of Computers
基金
中央高校研究生创新基金资助项目(CDJXS11180013)
关键词
不平衡数据集
文本分类
特征选择
正类
负类
unbalanced data sets
text classification
feature selection
positive class
negative class