摘要
随着World Wide Web的快速发展,Deep Web中蕴含了越来越多的可供访问的信息。这些信息可以通过网页上的表单来获取,它们是由Deep Web后台数据库动态产生的。传统的Web爬虫仅能通过跟踪超链接检索普通的SurfaceWeb页面,由于没有直接指向Deep Web页面的静态链接,所以当前大多数搜索引擎不能发现和索引这些页面。然而,与Surface Web相比,Deep Web中所包含的信息的质量更高,对我们更有价值。本文提出了一种利用HtmlUnit框架设计Deep Web爬虫的方法。它能够集成多个领域站点,通过分析查询表单从后台数据库中检索相关信息。实验结果表明此方法是有效的。
As the World Wide Web grows rapidly, more and more data become available in the Deep Web. The data can be obtained by submiting form in the Web pages and arise dynamicly from Deep Web database. Traditional Web crawler only can retrieve Surface Web page by following hyperlinks. Since there is no static links to the hidden Web pages, most search engines cannot discover and index such pages. However, compared to surface Web,the information provided by hidden Web sites is often of more high quality and can be more valuable to us. A method of designing deep Web crawler by use of HtmlUnit framework is proposed in this paper. The crawler which integrate several Web sites can analyze form and fill them automatically to retrieve relevant information from the database. The results of a number of experiments carded out with actual Deep Web sites demonstrate the accuracy of the method.
出处
《计算机与现代化》
2009年第3期31-34,共4页
Computer and Modernization