期刊文献+

基于层次树模型的Deep Web数据提取方法 被引量:14

Retrieving Deep Web Data Based on Hierarchy Tree Model
下载PDF
导出
摘要 网络在成为信息查询和发布平台的同时,海量的信息隐藏在查询受限的Web数据库中,使得人们无法有效地获取这些高质量的数据记录.传统的Deep Web搜索研究主要集中在通过关键字接口获取Web数据库内容.但是,由于Deep Web具有多属性和top-k的特点,基于关键字的方法具有固有的缺点,这就为Deep Web查询和检索带来了挑战.为了解决这个问题,提出了一种基于层次树的DeepWeb数据获取方法,该方法可以无重复和完整地提取Web数据库中的数据记录.该方法首先把Web数据库模型化为一棵层次树,Deep Web数据获取问题就转化为树的遍历问题.其次,对树中的属性排序,缩小遍历空间;同时,利用基于属性值相关度的启发规则指导遍历过程提高遍历效率.最后,在本地模拟数据库和真实Web数据库上的大量实验证明,这种方法可以达到很好的覆盖度和较高的提取效率. While the Web provides a platform for information search and dissemination,massive information is hidden behind in the query restricted Web databases,which makes it difficult to obtain these high-quality data records.The current research on Deep Web search has focused on crawling the Deep Web data via Web interfaces with Key words:queries.However,these keywords-based methods have inherent limitations because of the multi-attributes and top-k features of the Deep Web.This poses a great challenge for Web information search and retrieval.To address this problem,we propose an approach for siphoning structured data based on hierarchy tree,which can retrieve all the data non-repeatedly in the hidden databases.Firstly,we model the hidden database as a hierarchy tree.Under this theoretical framework,data retrieving is transformed into a traversing problem in the hierarchy tree.Secondly,we also propose techniques to narrow the query space and obtain the attribute values by sorting the attributes according to the ascending order.Thirdly,we leverage the mutual information to measure the attribute values dependency.Based on the attribute values dependency,we narrow the traversal space by using heuristic rule to guide the traversal process.Finally,we conduct extensive experiments over real Deep Web sites and controll databases to illustrate the coverage and efficiency of our techniques.
出处 《计算机研究与发展》 EI CSCD 北大核心 2011年第1期94-102,共9页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60970018)
关键词 隐藏数据库 数据提取 多属性值接口 top-k元组 互信息 hidden database data retrieval multi-attribute interfaces top-k tuple mutual information
  • 相关文献

参考文献17

  • 1Bergman M K. The Deep Web: Surfacing hidden value [J]. Journal of Electronic Publishing, 2001, 7(1): 1174-1175.
  • 2Chang K C C, He B, Li C, et al. Structured databases on the Web: Observations and implications [J]. SIGMOD Record, 2004, 33(3): 61-70.
  • 3Liu W, Meng X F, Meng W Y. Deep Web data integration, WAMDM-TR-2006-3 [OL]. [ 2010-01-10]. http://idke. rue. edu. cn/reports/report2006/seminar% 20summary[Deep] 20Web. pdf.
  • 4Barbosa L, Freire J. An adaptive crawler for locating hidden Web entry points [C] //Proc of the 16th Int Conf on World WideWeb(WWW). NewYork: ACM, 2007: 441-450.
  • 5Barbosa L, Freire J. Searching for hidden-Web databases [C] //Proc of the 8th ACM SIGMOD Int Workshop on Web and Databases (WebDB). New York: ACM, 2005: 1-6.
  • 6He Hal, Meng Weiyi, Clement T Y, et al. WISE- Integrator: A system for extracting and integrating complex Web search interfaces of the Deep Web [C]//Proc of the 31st Int Conf on Very Large Data Bases(VLDB). New York: ACM, 2005: 1314-1317.
  • 7Wu Wensheng, AnHai Doan, Clement T Y. WebIQ: learning from the Web to match Deep-Web query interfaces [C] //Proc of the 22nd Int Conf on Data Engineerlng(ICDE), Washington D C: IEEE Computer Society Press, 2006: 44- 54.
  • 8Madhavan J, Ko D, Kot L, et al. Google's Deep Web crawl [J]. PVLDB, 2008, 1(2): 1241-1252.
  • 9Cui Tao, David W Embley. Automatichidden-Web table interpretation by sibling page comparison [C] //Proc of the 26th Int Conf on Conceptual Modeling (ER). Berlin: Springer, 2007:560-581.
  • 10Liu W, Mcng X F, Meng W Y. VIDE: A vision based approach for Deep Web data extraction [J]. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010, 22(3): 447-460.

二级参考文献7

  • 1Meng X F, Lu H J, Wang H Y, et al. SG-WRAP: A schemaguided wrapper generator demonstration. In: Proc of ICDE'2002. Los Alamitos, CA: IEEE Computer Society Press, 2002.331 ~332
  • 2Meng X F, Hu D D, Li C. Schema guided wrapper maintenance for Web-data extraction. In: Proc of ACM WIDM' 2003. New York: ACM Press, 2003. 1~8
  • 3Meng X F, Wang H Y, Hu D D, et al. Sg-wram: Schema guided wrapper maintenance. In: Proc of ICDE' 2003. Los Alamitos,CA: IEEE Computer Society Press, 2003. 750~752
  • 4Meng X F, Lu H J, Wang H Y, et al. Schema-guided data extraction from the Web. Journal of Computer Science and Technology, 2002, 17(4): 377~388
  • 5V Crescenzi, G Mecca, P Merialdo. ROADRUNNER: Towards automatic data extraction from large Web sites. In: Proc of VLDB'2001. San Francisco, CA: Morgan Kaufmann, 2001. 109~118
  • 6A Arasu, H Garcia-Molina. Extracting structured data from Web pages. In: Proc of ACM SIGMOD'03. New York: ACM Press,2003. 337~348
  • 7St(e)phane Grumbach, Giansalvatore Mecca. In search of the lost schema. In: Proc of ICDT'1999. Berlin: Springer, 1999. 314~331

共引文献20

同被引文献89

引证文献14

二级引证文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部