基于领域特征和用户查询取样的Deep Web数据源描述方法

Data Source Description Approach for Deep Web Based on Domain Features and User Query-based Sampling

导出

摘要 [目的/意义]数据源描述(又称数据源摘要)是Deep Web集成检索领域存在的关键问题之一,数据源描述的质量直接影响着集成检索系统的检索效率和效果。本文提出一种基于领域特征和用户查询取样的数据源描述方法,以期为非合作环境下资源集成应用与研究提供参考和借鉴。[方法/过程]该方法为异构非合作型数据源的离线取样方法,通过分析数据源和用于查询的领域主题属性,依次构建领域特征词集、初始特征词集和高频特征词集,并最终获得以高频特征词查询取样的数据源描述信息。结合流行的CORI算法,深入分析基于推理网络的用户查询与数据源描述的相关度计算方法,并基于此方法设计基于Lemur工具集的集成检索系统,验证了上述方法的有效性。[结果/结论]所提方法在查全率和查准率方面均得到很好的表现。与其他方法相比,该方法在样本数据自动更新和运维管理方面具有明显成本优势和实用价值。 [ Purpose/significance ] Data source description or resource representation is a key issue of Deep Web Integrated Retrieval, as its quality has a direct impact on the retrieval efficiency and effectiveness of the Integrated Retrieval System. This paper proposes a data source description approach based on domain features and user query-based sampling, to provide reference for the related application and research on resources integration in the non-cooperate environment. [ Method/process] The approach is a kind of offline sampling method with heterogeneity non-cooperate data. By analyzing the data source and its domain subject features, it constructs the domain feature word set, initial feature word set, and high frequency feature word set one by one, and obtains the data source description information by query-based sampling of high frequency feature words. Then the paper analyzes the calculating method of relevance between the query and data source descriptions based on inference network using CORI algorithms, designed and developed a Deep Web Integrated Retrieval system based on Lemur toolkit to test the effectiveness of the approach. [ Result/conclusion ] The Results show that this method achieves high performance at both recall and precision. Compared with other approaches, it has a distinct cost advantage and a good practical value in the automatic renew of data and operation and maintenance management

作者袁国华寇晶晶李芳

机构地区中国科学院文献情报中心中国科学院大学

出处《图书情报工作》 CSSCI 北大核心 2017年第15期138-145,共8页 Library and Information Service

基金国家社会科学基金项目"基于开放获取学术期刊的资源深度整合与揭示研究"(项目编号:16BTQ025)研究成果之一

关键词深层网络数据源描述查询取样推理网络 deep web data source description query-based sampling inference network

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量：136
2万常选,邓松,刘德喜,江腾蛟,刘喜平.面向混合类型关键词查询的非合作结构化深网数据源选择[J].计算机研究与发展,2014,51(4):905-917. 被引量：6
3邓松.实体信息集成检索的深网数据源选择[J].计算机工程,2016,42(10):75-79. 被引量：2

二级参考文献91

1.[EB/OL].http://www.cogsci.Princeton.edu,.
2Fetterly D,Manasse M,Najork M,Wiener J L.A largescale study of the evolution of Web pages//Proceedings of the 12th International World Wide Web Conference.Budapest,2003:669-678
3Chang K C,He B,Li C,Patel M,Zhang Z.Structured databases on the Web:Observations and Implications.SIGMOD Record,2004,33(3):61-70
4Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web//Proceedings of the 14th Australasian Database Conference(ADC 2003).Adelaide,2003:181-189
5Zhang Z,He B,Chang K C.Understanding Web query interfaces:Best-effort parsing with hidden syntax//Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data.Paris,2004:107-118
6Arasu A,Garcia-Molina H.Extracting structured data from Web pages//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data.San Diego,2003:337-348
7Crescenzi V,Mecca G,Merialdo P.RoadRunner:Towards automatic data extraction from large Web sites//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001:109-118
8Wittenburg K,Weitzman L.Visual grammars and incremental parsing for interface languages//Proceedings of the IEEE Symposium on Visual Languages (VL).Skokie,1990:111-118
9He H,Meng W,Yu C T,Wu Z.WISE-integrator:An automatic integrator of Web search interfaces for e-commerce//Proceedings of the 29th International Conference on Very Large Data Bases.Berlin,2003:357-368
10Peng Q,Meng W,He H,Yu C T.WISE-cluster:Clustering e-commerce search engines automatically//Proceedings of the 6th ACM International Workshop on Web Information and Data Management.Washington,2004:104-111

共引文献141

1魏勇刚,张国春,常勇,袁方.基于词性分析和领域知识的Deep Web语义标注[J].郑州大学学报（理学版）,2009,41(1):52-55. 被引量：7
2郑淑丽,韩江洪,程文娟,吴永忠.Deep Web查询接口自动识别方法[J].郑州大学学报（理学版）,2009,41(1):56-58. 被引量：1
3李颖,刘国华,佟冰,刘顺江.基于素数的多源模式匹配方法的研究[J].燕山大学学报,2009,33(2):141-145. 被引量：1
4李益民.一种基于关键词的大规模Deep Web信息检索系统[J].图书情报工作,2008,52(10):29-32.
5鲜学丰,方巍,赵朋朋,崔志明,胡鹏昱.一种Deep Web数据源质量评估模型[J].微电子学与计算机,2008,25(10):47-50. 被引量：6
6崔晓军,彭智勇,曾承.基于多标注源的Deep Web查询结果自动标注[J].计算机应用,2009,29(1):196-200. 被引量：3
7李益民,魏立新,解成俊.基于用户模式Deep Web检索系统的研究[J].计算机工程与设计,2009,30(3):767-769.
8马安香,张斌,高克宁,齐鹏,张引.基于结果模式的Deep Web数据抽取[J].计算机研究与发展,2009,46(2):280-288. 被引量：15
9李齐会.Deep Web查询接口的判定技术研究[J].计算机与数字工程,2009,37(3):131-134. 被引量：1
10高明,黄哲学.Deep Web研究现状与展望[J].集成技术,2012,1(3):47-54. 被引量：1

1吴丹,陆柳杏.基于出声思考法的步行导航关注点研究[J].数据分析与知识发现,2017,1(5):23-31. 被引量：2
2张淑洁.数字技术在虚拟演播室节目制作中的应用与研究[J].科技传播,2017,9(3):92-93. 被引量：7
3马瑜璠.深度学习研究综述[J].读书文摘（中）,2017(3). 被引量：1
4汪荣.基于Web分析的数字资源整合系统研究与设计[J].图书情报导刊,2017,2(7):20-26.
5陈连艳.公益组织参与农村精准扶贫:价值取向与行动逻辑[J].河北科技师范学院学报（社会科学版）,2017,16(2):19-23. 被引量：3
6宋寒,刘玉清,代应.研发外包成果转化中服务商道德风险防范:合作奖励还是非合作监督?[J].科技管理研究,2017,37(11):240-249. 被引量：1
7蔡万江,姜红,李艳杰.中国智库研究可视化分析[J].重庆大学学报（社会科学版）,2017,23(5):60-67. 被引量：6
8周耀林,赵跃,孙晶琼.非物质文化遗产信息资源组织与检索研究路径——基于本体方法的考察与设计[J].情报杂志,2017,36(8):166-174. 被引量：51
9Feng Qian.Smart and Optimal Manufacturing： The Key for the Transformation andDevelopment of the Process Industry[J].Engineering,2017,3(2):151-151. 被引量：4
10马春庆,赵燕.“网络群体性事件”的界定及防治策略[J].东岳论丛,2017,38(8):188-192. 被引量：7

图书情报工作

2017年第15期

浏览历史

内容加载中请稍等...

基于领域特征和用户查询取样的Deep Web数据源描述方法

参考文献3

二级参考文献91

共引文献141

相关作者

相关机构

相关主题

浏览历史