期刊文献+

一种Deep Web爬虫的设计与实现 被引量:5

Design and Implementation of a Deep Web Crawler
下载PDF
导出
摘要 随着World Wide Web的快速发展,Deep Web中蕴含了越来越多的可供访问的信息。这些信息可以通过网页上的表单来获取,它们是由Deep Web后台数据库动态产生的。传统的Web爬虫仅能通过跟踪超链接检索普通的SurfaceWeb页面,由于没有直接指向Deep Web页面的静态链接,所以当前大多数搜索引擎不能发现和索引这些页面。然而,与Surface Web相比,Deep Web中所包含的信息的质量更高,对我们更有价值。本文提出了一种利用HtmlUnit框架设计Deep Web爬虫的方法。它能够集成多个领域站点,通过分析查询表单从后台数据库中检索相关信息。实验结果表明此方法是有效的。 As the World Wide Web grows rapidly, more and more data become available in the Deep Web. The data can be obtained by submiting form in the Web pages and arise dynamicly from Deep Web database. Traditional Web crawler only can retrieve Surface Web page by following hyperlinks. Since there is no static links to the hidden Web pages, most search engines cannot discover and index such pages. However, compared to surface Web,the information provided by hidden Web sites is often of more high quality and can be more valuable to us. A method of designing deep Web crawler by use of HtmlUnit framework is proposed in this paper. The crawler which integrate several Web sites can analyze form and fill them automatically to retrieve relevant information from the database. The results of a number of experiments carded out with actual Deep Web sites demonstrate the accuracy of the method.
作者 荣光 张化祥
出处 《计算机与现代化》 2009年第3期31-34,共4页 Computer and Modernization
关键词 DEEP WEB WEB爬虫 表单 Deep Web Web crawler form
  • 相关文献

参考文献15

  • 1刘伟,孟小峰,等.Deep Web数据集成问题研究[R].WAMDM技术报告,2006.
  • 2Michael K Bergman. The Deep Web: Surfacing Hidden Value [ EB/OL]. http ://www. brightplanet. com/resources/ details/deepweb.html,2001-09-24.
  • 3Sriram R. Crawling the hidden Web [ C ]// Proceedings of VLDB. Rome Italy ,2001 : 129-138.
  • 4Luciano B. Searching for hidden-Web databases [ C ]//Proceedings of WebDB. Baltimore, Maryland, USA ,2005 : 1-6.
  • 5郑冬冬,赵朋朋,崔志明.Deep Web爬虫研究与设计[J].清华大学学报(自然科学版),2005,45(S1):1896-1902. 被引量:28
  • 6凌妍妍,刘伟,王仲远,艾静,孟小峰.Deep Web数据集成中的实体识别方法[J].计算机研究与发展,2006,43(z3):46-53. 被引量:4
  • 7Lueiano B. Siphoning hidden-Web data through keywordbased interfaces [ C ]//Proceedings of SBBD. Brasilia, Brazil, 2004 : 309 -321.
  • 8Bin H. Statistical schema matching across Web query interfaces [ C ]//Proceedings of SIGMOD. New York, NY, USA, 2003:217-228.
  • 9Cope J, Craswell N, Hawking D. Automated discovery of search interfaces on the Web [ C ]//Proceedings of ADC. Australia,2003:181-189.
  • 10Chang K C,He B. Structured databases on the Web:Observations and implications [ J ]. SIGMOD Record, 2004,33 (3) :61-70.

二级参考文献16

  • 1[1]Kevin Chen-Chuan Chang,Bin He,Chengkai Li,et al.Structured databases on the Web:Observations and implications.SIGMOD Record,2004,33(3):61-70
  • 2[2]W Frakes,R Baeza-Yates.Information Retrieval:Data Structures and Algorithms.Englewood Cliffs,NJ:Prentice Hall,1992
  • 3[3]W William.Cohen:Integration of heterogeneous databases without common domains using queries based on textual similarity.SIGMOD Conf,Seattle,Washington,1998
  • 4[4]Sunita Sarawagi.Anuradha bhamidipaty.Interactive deduplication using active learning.KDD,Edmonton,Alberta,Canada,2002
  • 5[5]E Winkler.The state of record linkage and current research problems.http://www.census.gov/srd/www/hyyear.html,1999
  • 6[6]Sheila Tejada,Craig A Knoblock,Steven Minton.Learning domain-independent string transformation weights for high accuracy object identification.KDD,Acapulco,Mexico,2002
  • 7[7]A Doan A,Y Lu,Y Lee,et al.Object matching for information integration:A profiler-based approach.IIWeb,2003
  • 8Walker Troy.Automating the extraction of domain-specific information from the web-a case study for the genealogical domain[].Brigham Young University.2004
  • 9Barbosa L,Freire J.Siphoning hidden-web data through keyword-based interfaces[].SBBD.2004
  • 10Modica G,Gal A,Jamil H M.The use of machine-generated ontologies in dynamic information seeking[].Proceedings of the th International Conference on Cooperative Information Systems.2001

共引文献29

同被引文献44

引证文献5

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部