摘要
针对传统文档自动分类方法和目前语义分类方法中存在的问题,提出一种新的基于概念向量空间的文档语义分类模型,该模型通过字符匹配算法将原文档高维词向量空间中相互独立的词项匹配到描述本体概念的属性集合,进而映射成属性集合对应的本体概念,形成低维的、语义丰富的文档概念向量空间。采用目前非常流行的数据集"20Newsgroups"作为实验数据集,对基于概念向量空间的文档语义分类模型进行实验验证。实验结果表明:提出的文档语义分类方法与传统基于词向量空间的文档分类方法相比,能够极大地降低向量空间维度,提高文档分类的性能。
For solving the existing problems in the traditional text classification methods and the current semantic classification methods, this paper proposes a new semantic classification model of documents based on concept vector space. This model utilizes character-based matching algorithm to match words in word vector space of documents with attribute sets of ontology concepts, if words are exist in attribute sets. Then it replaces words with ontology concepts corresponding to attribute sets, thus the concept vector space with the lower dimensionality and abundant semantics is formed. The paper takes the "20Newsgroups" as experimental datasets and carries out a semantic classification experiment of documents. Experimental results show that the proposed method can greatly decrease the dimensionality of vector space and improve the text classification performance.
出处
《图书情报工作》
CSSCI
北大核心
2011年第24期106-111,26,共7页
Library and Information Service
关键词
概念向量空间
文档自动分类
文档语义分类
模型
concept vector space automatic classification of documents semantic classification of documents model