摘要
文本分类是处理和组织大量文本数据的关键技术之一。为了更加有效地实现文本分类,本文提出了一种基于图模型的文本特征提取方法。该方法利用类别信息在训练数据集上构造邻接带权图及其补图,使得属于同一个类别的样本点的投影尽可能近,不属于同一个类别的样本点的投影尽可能远。这种方法既能够获得文本空间的全局结构信息又可以保留局部结构信息。最后,采用K近邻分类器在20Newsgroups标准数据集上进行训练和测试,并且与基于潜在语义索引的文本分类方法做了比较,文本分类的性能得到很大提高。实验结果表明,本文所提出的方法能够有效地提高文本分类的性能。
Text categorization is one of the techniques for processing and organizing masswe documents. This paper proposes a kind of feature extraction method based on graph model for text categorization so that the classification can be implemented effectively. The novel method utilizes the class information to construct an adjacent weighted graph and its complement on training set, which ensure the projections of samples belonged to the same class are close to each other and the projections of samples pertained to the different class are far away from each other. It not only obtains the global structure but also preserves the local structure of document space. We have conducted experiments on a subset of 20 Newsgroups using k-nearest neighbor classifier, and our experimental results show that the presented method outperforms the classical latent semantic indexing for text categorization. It can improve the performance of text categorization.
出处
《情报科学》
CSSCI
北大核心
2011年第8期1248-1251,1272,共5页
Information Science
基金
国家自然科学基金资助项目(60971088
60673186)
关键词
文本分类
特征提取
潜在语义索引
图模型
text categorization
feature extraction
latent semantic indexing
graph model