摘要
网络大数据的大规模、多源异构、动态更新、高噪声给知识的获取带来了很大的挑战。特别地,很多网站隐藏在HTML表单后端的Web数据库中的Deep Web数据,只能通过提交表单查询的方式进行动态访问,网络爬虫难以通过页面之间的链接关系采集到这些数据,影响了获取到的知识资源的覆盖率,如何高效地采集这些数据并加以利用非常具有挑战性。为此对现有的Deep Web数据采集的查询构造方法进行了详细分析,分别介绍了针对不同类型的表单对应的Deep Web数据采集查询构造方法;总结了现有表层化方式的Deep Web数据采集查询构造方法的优缺点,并对Deep Web数据采集查询构造方法的未来工作进行了展望,以推动Deep Web数据采集技术的进一步发展。
Network big data bring a great challenge to the knowledge acquisition because of large-scale, heterogeneity,dynamic and high noise. Specially, many websites data are hidden in Web databases behind the HTML forms, called Deep Web data, which can only be dynamically accessed by performing form submissions. These data can not be covered by Web crawlers as a result of using hyperlinks to collect resources, which affects the coverage of knowledge resources. Therefore, how to efficiently crawl these data and make use of them is challenging. This paper firstly presents a detailed analysis of the existing Deep Web data acquisition query construction methods, and introduces the Deep Web data acquisition query construction methods according to the different types of forms. Secondly, this paper concludes the advantages and limitations of the existing methods. Finally, this paper proposes the future work to promote the development of the Deep Web crawling techniques.
出处
《计算机科学与探索》
CSCD
北大核心
2015年第9期1025-1033,共9页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金Nos.61173008
61232010
61303244
61402442
国家重点基础研究发展计划(973计划)Nos.2014CB340401
2013CB329602
北京市科技新星计划项目No.Z121101002512063
北京市自然科学基金No.4154086~~