期刊文献+

Deep Web数据采集查询构造方法研究 被引量:2

Research on Query Construction Method for Deep Web Data Crawling
下载PDF
导出
摘要 网络大数据的大规模、多源异构、动态更新、高噪声给知识的获取带来了很大的挑战。特别地,很多网站隐藏在HTML表单后端的Web数据库中的Deep Web数据,只能通过提交表单查询的方式进行动态访问,网络爬虫难以通过页面之间的链接关系采集到这些数据,影响了获取到的知识资源的覆盖率,如何高效地采集这些数据并加以利用非常具有挑战性。为此对现有的Deep Web数据采集的查询构造方法进行了详细分析,分别介绍了针对不同类型的表单对应的Deep Web数据采集查询构造方法;总结了现有表层化方式的Deep Web数据采集查询构造方法的优缺点,并对Deep Web数据采集查询构造方法的未来工作进行了展望,以推动Deep Web数据采集技术的进一步发展。 Network big data bring a great challenge to the knowledge acquisition because of large-scale, heterogeneity,dynamic and high noise. Specially, many websites data are hidden in Web databases behind the HTML forms, called Deep Web data, which can only be dynamically accessed by performing form submissions. These data can not be covered by Web crawlers as a result of using hyperlinks to collect resources, which affects the coverage of knowledge resources. Therefore, how to efficiently crawl these data and make use of them is challenging. This paper firstly presents a detailed analysis of the existing Deep Web data acquisition query construction methods, and introduces the Deep Web data acquisition query construction methods according to the different types of forms. Secondly, this paper concludes the advantages and limitations of the existing methods. Finally, this paper proposes the future work to promote the development of the Deep Web crawling techniques.
出处 《计算机科学与探索》 CSCD 北大核心 2015年第9期1025-1033,共9页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金Nos.61173008 61232010 61303244 61402442 国家重点基础研究发展计划(973计划)Nos.2014CB340401 2013CB329602 北京市科技新星计划项目No.Z121101002512063 北京市自然科学基金No.4154086~~
关键词 DEEP WEB 查询接口 查询构造 网络爬虫 Deep Web query interface query construction Web crawler
  • 相关文献

参考文献24

  • 1孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169. 被引量:2393
  • 2王元卓,靳小龙,程学旗.网络大数据:现状与展望[J].计算机学报,2013,36(6):1125-1138. 被引量:714
  • 3Bergman M K. White paper: the Deep Web: surfacing hidden value[J]. Journal of Electronic Publishing, 2001, 7(1).
  • 4Chang K C C, He Bin, Li Chengkai B, et al. Structured databases on the Web: observations and implications[J]. ACM SIGMOD Record, 2004, 33(3): 61-70.
  • 5He Bin, Patel M, Zhang Zhen, et al. Accessing the Deep Web: a survey[J]. Communications of the ACM, 2007, 50(5): 94-101.
  • 6Madhavan J, Jeffery S, Cohen S, et al. Web-scale data integration: you can only afford to pay as you go[C]//Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, Asilomar, USA, Jan 7-10, 2007: 342-350.
  • 7刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量:136
  • 8Bhalerao N, Shinde D S K. Deep Web crawl for Deep Web extraction[J]. International Journal of Engineering Research and Technology, 2013, 2(3).
  • 9Shestakov D. Current challenges in Web crawling[C]//LNCS 7977: Proceedings of the 13th International Conference on Web Engineering, Aalborg, Denmark, Jul 8-12, 2013. Berlin, Heidelberg: Springer, 2013: 518-521.
  • 10Gupta S, Bhatia K K. Deep questions in the "deep or hidden" Web[C]//Proceedings of the 2nd International Conference on Soft Computing for Problem Solving, Jaipur, India, Dec 28-30, 2012: 821-829.

二级参考文献295

  • 1.[EB/OL].http://www.cogsci.Princeton.edu,.
  • 2Fetterly D,Manasse M,Najork M,Wiener J L.A largescale study of the evolution of Web pages//Proceedings of the 12th International World Wide Web Conference.Budapest,2003:669-678
  • 3Chang K C,He B,Li C,Patel M,Zhang Z.Structured databases on the Web:Observations and Implications.SIGMOD Record,2004,33(3):61-70
  • 4Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web//Proceedings of the 14th Australasian Database Conference(ADC 2003).Adelaide,2003:181-189
  • 5Zhang Z,He B,Chang K C.Understanding Web query interfaces:Best-effort parsing with hidden syntax//Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data.Paris,2004:107-118
  • 6Arasu A,Garcia-Molina H.Extracting structured data from Web pages//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data.San Diego,2003:337-348
  • 7Crescenzi V,Mecca G,Merialdo P.RoadRunner:Towards automatic data extraction from large Web sites//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001:109-118
  • 8Wittenburg K,Weitzman L.Visual grammars and incremental parsing for interface languages//Proceedings of the IEEE Symposium on Visual Languages (VL).Skokie,1990:111-118
  • 9He H,Meng W,Yu C T,Wu Z.WISE-integrator:An automatic integrator of Web search interfaces for e-commerce//Proceedings of the 29th International Conference on Very Large Data Bases.Berlin,2003:357-368
  • 10Peng Q,Meng W,He H,Yu C T.WISE-cluster:Clustering e-commerce search engines automatically//Proceedings of the 6th ACM International Workshop on Web Information and Data Management.Washington,2004:104-111

共引文献3105

同被引文献24

引证文献2

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部