摘要
现有的关键词抽取算法大部分是基于单篇文档的,虽然能成功抽取出单个文章的关键词,却无法满足针对多文档的关联检索。以单文档关键词抽取为基础,引入多文本文摘中的质心概念和MMR公式并加以变形,提出并分析比较了两种多文档关键词抽取算法对内容相近的多篇文章进行关键词抽取,并按照权重生成关键词向量,建立基于关键词向量空间的用户兴趣模型。通过对5个主题100篇文章的测试表明,使用这两种算法提取出的关键词的准确率和召回率均达到了85%左右,能够较为准确地表示用户的兴趣模型。
Most methods for extracting keywords only extract keywords from a single document. They can successfully extract keywords, but they cannot deal with the query for multiple documents. User modeling is one of the crucial techniques in content - based information retrieval. This paper presents two methods for extracting keywords from multiple documents to construct user models. The first method builds a cluster centroid, where a collection of the most important words form the whole cluster. And the second one applies a multiple document summarization method for IR results, called MMR - MD(Maximal Marginal Relevance - Multi - Document). Experimental results show that, both of the two methods achieve around 85% precision and recall compared with the keywords extracted by experts.
出处
《计算机仿真》
CSCD
2007年第2期103-105,109,共4页
Computer Simulation
关键词
关联检索
关键词抽取
用户模型
Information retrieval
Keywords extraction
User model