摘要
文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化 文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与系统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.
Structured documents are made up of a few logical components,such as title,sections,subsections and paragraphs.The components in each structured document can be represented by an ordered tree model,which can also be viewed as a hierarchical concept relationship.To meet the user抯 requirements for more precise and concentrated search results,the retrieval techniques should allow the user to retrieve document components with varying granularity.This paper presents a method to query document database by content and structure.The key idea is to construct a more comprehensive similarity function by taking advantage of the inherent hierarchical structure in documents.This work combines Information Retrieval techniques,semi-structured data query and proximate search for document documents.The proposed method is evaluated on the Encarta encyclopedia document set and the experimental results show that is can provice more accurate and focused answers than traditional document retrieval methods.
出处
《软件学报》
EI
CSCD
北大核心
2003年第5期976-983,共8页
Journal of Software
基金
This work was performed while the first author was a visiting student at Microsoft Research Asia.
关键词
文档数据库
结构查询
结构化文档
计算相似度
document database
information retrieval
passage retrieval
structured document