期刊文献+

基于词共现的文档表示模型 被引量:8

A Co-occurrence based Vector Space Model for Document Indexing
下载PDF
导出
摘要 文档表示模型是文本自动处理的基础,是将非结构化的文本数据转化为结构化数据的有效手段。然而,目前通用的空间向量模型(Vector Space Model,VSM)是以单个的词汇为基础的文档表示模型,因其忽略了词间的关联关系,导致文本挖掘的准确率难以得到很大的提升。该文以词共现分析为基础,讨论了文档主题与词的二阶关系之间的潜在联系,进而定义了词共现度及与文档主题相关度的量化计算方法,利用关联规则算法抽取出文档集上的词共现组合,提出了基于词共现组合的文档向量主题表示模型(Co-occurrence Term based Vector SpaceModel,CTVSM),定义了基于CTVSM的文档相似度。实验表明,CTVSM能够准确反映文档之间的相关关系,比经典的文档向量空间模型(Vector Space Model,VSM)具有更强的主题区分能力。 This paper presents a novel co-occurrence terms based vector space model(CTVSM) for automatic document indexing which is inspired by the Vector Space Model(VSM).In contrast to the traditional VSM which presents the document with a bag of words regardless the position of these words in the texts,the proposed technique uses the co-occurrence terms instead of the single term.Firstly the pairs of obvious co-occurrence terms are extracted from the document set by association rules,and then the similarity between documents is also defined in this paper.The experiments indicate substantial and consistent improvements of the CTVSM over standard VSM.
作者 常鹏 冯楠
出处 《中文信息学报》 CSCD 北大核心 2012年第1期51-57,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(70901054)
关键词 文档建模 词共现 文档相似度 文本挖掘 document model co-occurrence document similarity text mining
  • 相关文献

参考文献11

二级参考文献43

  • 1万小军,彭宇新.A New Retrieval Model Based on TextTiling for Document Similarity Search[J].Journal of Computer Science & Technology,2005,20(4):552-558. 被引量:2
  • 2王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量:129
  • 3William W. Cohen. Fast effective rule induction[C]// Machine Learning Proceedings of the Twelfth International Conference on Machine Learning. Tahoe City, California, USA: Morgan Kaufmann, 1995: 115-123.
  • 4X. Carreras, L. Marquez. Boosting Trees for Anti Spam Email Filtering [C]//Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001). 2001: 58-64.
  • 5I. Androutsopoulos, G. Paliouras, V. Karkaletsis, etc, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach[C]// Proc. 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000). 2000: 1-13.
  • 6H. Drueker, D. Wu, V. N. Vapnik, Support Vector Machines for Spam Categorization [ J/OL ]. IEEE Transactions on Neural Networks, 1999, 20 (5) : 1048-1054.
  • 7M. Sahami, S. Dumais, D. Heckerman etc, A Bayesian approach to filtering junk e-mail [C]//Proc. of AAAI Workshop on Learning for Text Categorization. 1998: 55-62.
  • 8Peat H J, Willet P. The limitations of term co-occurrence data for query expansion in document retrieval systems [J/OL]. JASIS, 1991, 42(5):378-383.
  • 9G Salton, A Wong, C S Yang. On the specification of term values in automatic indexing [J/OL]. Journal of Documentation, 1973, 29(4) :351-372.
  • 10Y. Yang. A Comparative Study on Feature Selection in Text Categorization [C]//Proceeding of the Fourteenth International Conference on Machine Learning (ICML'97) . 1997, 412-420.

共引文献101

同被引文献82

引证文献8

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部