摘要
当用户向XML检索引擎提交查询后,返回的结果通常远远多于用户的期望,返回结果中难免有一些不相关的文档或结点.对于以文档为中心的XML文档集合,XML片段检索是根据用户的查询,从XML检索引擎返回的XML文档或结点中抽取出仅包含数百字节的片段,用户可以通过该片段判断片段所在的XML文档或结点与查询的真实相关性,以决定是否有必要进一步阅读,从而有效地提高从XML文档中获取信息的效率.该文提出了基于结点权重模型的XML片段检索策略.该策略先利用结点权重模型ATG(平均主题概括强度)对XML文档集中的标签或路径设置权重,再将该权重用于BM25模型,得到BM25NW检索模型.在利用BM25NW检索出XML结点后,对结点中定长窗口进行评分,考察其是否适合作为片段内容.最后在保证信息冗余较小的条件下,选择得分较高的窗口内容组成片段返回给用户.INEX 2011片段检索任务上的评测结果显示,基于结点权重模型ATG的XML片段检索策略具有很强的竞争力,性能明显优于其它参赛系统.
In XML information retrieval,queries on XML search engines usually return far more results than the user expects and in which there lay lots of irrelevant results.As to a document-centric XML collection,the goal of XML snippet retrieval is to generate a snippet containing only hundreds of characters for each result returned by the XML search engine.Such snippet can provide sufficient information to allow the users to determine the relevance of its underlying document,instead of reading the document itself,which can help the users find what they want quickly.In this paper,a snippet retrieval strategy based on an element weighting model is proposed.In this strategy,all elements in an XML document are weighted automatically by Average Topic Generalization(ATG) model.Then the BM25EW model,which is obtained by applying element weights on BM25 model,is employed to retrieve and rank the relevant elements in an XML document collection.To extract a suitable snippet,all retrieved elements are split into some windows with the same length,which are then assessed.The windows with higher scores are extracted as snippets with the consideration that the redundancy is as little as possible.The experimental results on INEX 2011 Snippet Retrieval Track show that snippet retrieval strategy based on element weighting model ATG is competitive,and performs better than other participants.
出处
《计算机学报》
EI
CSCD
北大核心
2013年第8期1729-1744,共16页
Chinese Journal of Computers
基金
国家自然科学基金(60803105
61173146)
国家社会科学基金(12CTQ042)
江西省高等学校科技落地计划项目(KJLD12022)
江西省教育厅科学技术研究项目(赣教技字11731号)资助~~