期刊文献+

应用于用户兴趣建模的多文本关键词抽取研究 被引量:2

Research on Keyword-Extraction from Multi-Document in User Model
下载PDF
导出
摘要 现有的关键词抽取算法大部分是基于单篇文档的,虽然能成功抽取出单个文章的关键词,却无法满足针对多文档的关联检索。以单文档关键词抽取为基础,引入多文本文摘中的质心概念和MMR公式并加以变形,提出并分析比较了两种多文档关键词抽取算法对内容相近的多篇文章进行关键词抽取,并按照权重生成关键词向量,建立基于关键词向量空间的用户兴趣模型。通过对5个主题100篇文章的测试表明,使用这两种算法提取出的关键词的准确率和召回率均达到了85%左右,能够较为准确地表示用户的兴趣模型。 Most methods for extracting keywords only extract keywords from a single document. They can successfully extract keywords, but they cannot deal with the query for multiple documents. User modeling is one of the crucial techniques in content - based information retrieval. This paper presents two methods for extracting keywords from multiple documents to construct user models. The first method builds a cluster centroid, where a collection of the most important words form the whole cluster. And the second one applies a multiple document summarization method for IR results, called MMR - MD(Maximal Marginal Relevance - Multi - Document). Experimental results show that, both of the two methods achieve around 85% precision and recall compared with the keywords extracted by experts.
出处 《计算机仿真》 CSCD 2007年第2期103-105,109,共4页 Computer Simulation
关键词 关联检索 关键词抽取 用户模型 Information retrieval Keywords extraction User model
  • 相关文献

参考文献7

  • 1Dragomir R. Radev, Hongyan Jing, Malgorzala Budzikowska.Centroid - based Summarization of Multiple Documents:Sentence Extraction, Utility - based Evaluation, and UserStudies [C]. In ANLP/NAACL Workshop on Summarization,Seattle, WA, 2000 -4.
  • 2J G Carbonell, J Goldstein. The use of MMR, diversity - based reranking for reordering documents and producing summaries[ C]. In Proceedings of SIGIR - 98. Info. Proc. and Management , 1998,31 (5) :675 - 685.
  • 3Amy N Langville, Carl D Meyer. A Survey of Eigenvector Methods of Web Information Retrieval[J]. In SIAM Review,2003.
  • 4Yiming Yang and Xin Liu, A re - examination of text categorization methods[ C]. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999. 42 - 49.
  • 5陈克利,总成庆,王霞.基于大规模真实文本的平衡语料分析与文本分类方法,语言计算与基于内容的文本处理[C].全国第七届计算语言学联合学术会议论文集,清华大学出版社,2003—8.540—545.
  • 6黄萱菁 吴立德.独立于语种的文本分类方法[C]..2000International Conference on Multilingual Information Processing[C].,2000..
  • 7鲁松,等.文本中词语权重计算方法的改进[C].2000 International Conference Multilingual Information Processing, 2000. 31 - 36.

共引文献5

同被引文献42

引证文献2

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部