摘要
[目的/意义]数据源描述(又称数据源摘要)是Deep Web集成检索领域存在的关键问题之一,数据源描述的质量直接影响着集成检索系统的检索效率和效果。本文提出一种基于领域特征和用户查询取样的数据源描述方法,以期为非合作环境下资源集成应用与研究提供参考和借鉴。[方法/过程]该方法为异构非合作型数据源的离线取样方法,通过分析数据源和用于查询的领域主题属性,依次构建领域特征词集、初始特征词集和高频特征词集,并最终获得以高频特征词查询取样的数据源描述信息。结合流行的CORI算法,深入分析基于推理网络的用户查询与数据源描述的相关度计算方法,并基于此方法设计基于Lemur工具集的集成检索系统,验证了上述方法的有效性。[结果/结论]所提方法在查全率和查准率方面均得到很好的表现。与其他方法相比,该方法在样本数据自动更新和运维管理方面具有明显成本优势和实用价值。
[ Purpose/significance ] Data source description or resource representation is a key issue of Deep Web Integrated Retrieval, as its quality has a direct impact on the retrieval efficiency and effectiveness of the Integrated Retrieval System. This paper proposes a data source description approach based on domain features and user query-based sampling, to provide reference for the related application and research on resources integration in the non-cooperate environment. [ Method/process] The approach is a kind of offline sampling method with heterogeneity non-cooperate data. By analyzing the data source and its domain subject features, it constructs the domain feature word set, initial feature word set, and high frequency feature word set one by one, and obtains the data source description information by query-based sampling of high frequency feature words. Then the paper analyzes the calculating method of relevance between the query and data source descriptions based on inference network using CORI algorithms, designed and developed a Deep Web Integrated Retrieval system based on Lemur toolkit to test the effectiveness of the approach. [ Result/conclusion ] The Results show that this method achieves high performance at both recall and precision. Compared with other approaches, it has a distinct cost advantage and a good practical value in the automatic renew of data and operation and maintenance management
出处
《图书情报工作》
CSSCI
北大核心
2017年第15期138-145,共8页
Library and Information Service
基金
国家社会科学基金项目"基于开放获取学术期刊的资源深度整合与揭示研究"(项目编号:16BTQ025)研究成果之一
关键词
深层网络
数据源描述
查询取样
推理网络
deep web data source description query-based sampling inference network