期刊文献+

基于WordNet概念向量空间模型的文本分类 被引量:16

WordNet-based Concept Vector Space Model for Text Classification
下载PDF
导出
摘要 文章提出了一种文本特征提取方法,以WordNet语言本体库为基础,以同义词集合概念代替词条,同时考虑同义词集合间的上下位关系,建立文本的概念向量空间模型作为文本特征向量,使得在训练过程中能够提取出代表类别的高层次信息。实验结果表明,当训练文本集合很小时,方法能够较大地提高文本的分类准确率。 In this paper,we design and implement an automatic text classification system,aiming at improving the accuracy of text classification.ln current existing automatic text classification systems,the content of text is described by N-dimension feature vector model,but the approaches for establishing the model have great influence on the accuracy of text classification.Vector Space Model(VSM),as one of the most effective approaches,describes a document as orthogonal term vectors.The assumption of the VSM approach is that the semantic relation between terms is ignored.But in the real world,semantic relations between terms usually exlst,such as synonymy and hypemymy-hyponymy,etc.Here we introduce a novel approach,based on WordNet,for describing a text by establishing concept vector space model.In our approach,we can extract the high-level information on categories during training process by replacing terms with synonymy sets in WordNet and considering hypemymy-hyponymy relation between synonymy sets.We carry on a series of experiments to compare our approach with the term-based VSM approach.The results show that our approach could improve the accuracy of text classification especially when the size of trainning set is small.
作者 张剑 李春平
出处 《计算机工程与应用》 CSCD 北大核心 2006年第4期174-178,共5页 Computer Engineering and Applications
关键词 文本自动分类 WORDNET 概念向量 向量空间模型 text classification,WordNet,concept vector,VSM
  • 相关文献

参考文献16

  • 1Shankar S,Karypis G.Weight adjustment schemes for a centroid based classifier[R].Computer Science Technical Report TR00-035 ,Department of Computer Science,University of Minnesota,Minneapolis,Minnesota, 2000.
  • 2Yang Y.An Evaluation of Statistical Approaches to Text Category[J]. Journal of Information Retrieval, 1999 ; 1 (1/2) :67-88.
  • 3Cairo R A,Partridge M.A Comparative Study of Principal Component Analysis Techniques[C].In:Proe Ninth Australian Conf On Neural Networks, Brisbane, QLD, 1998.
  • 4Deerwester S,Dumais S T,Furnas G W et al.Indexing by Latent Semantic Analysis[J].Joumal of the American Society for Information Science, 1990;41 (6) :391-407.
  • 5Dumais S T.Using LSI for information filtering:TREC-3 experiments[C]. In : Proc of the Third Text Retrieval (TREC-3), National Institute of Standards and Technoloy, 1995.
  • 6Karypis G,Han E H.Concept indexing;A fast dimensionality reduction algorithem with applications to document retrieval & categorization[R]. Technical Report TR-00-016,Department of Computer Science,University of Minnesota,Minneapolis,2000.
  • 7Yang Y,Pedersen J O.A Comparative Study on Feature Selection in Text Categorization[C].In : ICML 97,1997:412-420.
  • 8Kohavi R,John G.Wrappers for Feature Subset Selection[J].Artificial Intelligence, 1997 ; 97 ( 1-2 ) : 273-324.
  • 9Thorsten Joachims.A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[C].In :Proceedings of ICML'97, 1997 : 143-151.
  • 10R Basili,A Moschitti,M Pazienza.A text classifier based on linguistic processing[C].In :Proceedings of IJCAI-99,Machine Learning for Information Filtering, 1999.

共引文献8

同被引文献155

引证文献16

二级引证文献53

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部