摘要
提出了一种基于词频分类器集成的文本分类方法·词频分类器是在对文本中的单词和它在每个文本中出现的频率进行统计后得到的简单分类器·虽然词频分类器本身泛化能力不强,但它不仅计算代较小,而且在训练样本甚至类别增加时易于进行更新,而整个学习系统的泛化能力可以由集成学习机制来提高,因此,词频分类器很适合用做集成学习的基分类器·在集成时,使用了改进的AdaBoost算法,加入了一种强制重新分布权的机制,避免算法过早停止,更加适合文本分类任务·在标准文集Reuters-21578上的实验结果表明,该方法能取得很好的效果·
In this paper, a method of text proposed. Term frequency classifier is a kind classification based on term frequency classifier ensemble is of simple classifier obtained after calculating terms' frequency of texts in the corpus. Though the generalization ability of term frequency classifier is not strong enough, it is a qualified base learner for ensemble because of i*s low computational cost, flexibility in updating with new samples and classes, and the feasibility of improving generalization with the help of ensemble paradigms. An improved AdaBoost algorithm is used to build the ensemble, which employs a scheme of compulsive weights updating to avoid early stop. Therefore it is more suitable for text classification. Experimental results on the corpus of Reuters-21578 show that the proposed method can achieve good performance in text classification tasks.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2006年第10期1681-1687,共7页
Journal of Computer Research and Development
基金
国家自然科学基金项目(60505013)
江苏省自然科学基金创新人才基金项目(BK2005412)~~