摘要
本文在介绍文本分类的背景及传统基于向量空间模型特征选择不足之处的同时,提出了不同特征选择方法相结合的文本分类模型。该模型首先对文本进行分析,把文本表示成向量空间的形式。文本在经过预处理后,按一定规则提取关键词,关键词的提取中增加了对名词短语的识别。特征选择的方法上,结合了文档频数和互信息量,并对他们进行了改进。实验结果表明,使用新方法进行分类所得到的分类精度得到了一定的提高。
This paper advances a new text categorization model combined with the method of different feature selection while presenting the background of text categorization and the insufficiency of the selection of Vector Space Model features. The texts are analyzed first by the shape of Vector Space Model to express them in the form of vector space. After pre-treating the texts,
出处
《学术问题研究》
2005年第1期94-98,共5页
Academic Research(Integrated Edition)
关键词
文本分类
特征选择
文档频数
互信息量
are extracted according to given rules including noun phrases. For feature selection, document frequency is combined with mutual information to get them improved. It is found that according to the experiment the precision of text categorization is definitely improved by using new classified method. Keywords:text categorization
feature selection
document frequency
mutual information