一种Deep Web爬虫的设计与实现被引量：5

Design and Implementation of a Deep Web Crawler

下载PDF

导出

摘要随着World Wide Web的快速发展,Deep Web中蕴含了越来越多的可供访问的信息。这些信息可以通过网页上的表单来获取,它们是由Deep Web后台数据库动态产生的。传统的Web爬虫仅能通过跟踪超链接检索普通的SurfaceWeb页面,由于没有直接指向Deep Web页面的静态链接,所以当前大多数搜索引擎不能发现和索引这些页面。然而,与Surface Web相比,Deep Web中所包含的信息的质量更高,对我们更有价值。本文提出了一种利用HtmlUnit框架设计Deep Web爬虫的方法。它能够集成多个领域站点,通过分析查询表单从后台数据库中检索相关信息。实验结果表明此方法是有效的。 As the World Wide Web grows rapidly, more and more data become available in the Deep Web. The data can be obtained by submiting form in the Web pages and arise dynamicly from Deep Web database. Traditional Web crawler only can retrieve Surface Web page by following hyperlinks. Since there is no static links to the hidden Web pages, most search engines cannot discover and index such pages. However, compared to surface Web,the information provided by hidden Web sites is often of more high quality and can be more valuable to us. A method of designing deep Web crawler by use of HtmlUnit framework is proposed in this paper. The crawler which integrate several Web sites can analyze form and fill them automatically to retrieve relevant information from the database. The results of a number of experiments carded out with actual Deep Web sites demonstrate the accuracy of the method.

作者荣光张化祥

机构地区山东师范大学信息科学与工程学院

出处《计算机与现代化》 2009年第3期31-34,共4页 Computer and Modernization

关键词 DEEP WEB WEB爬虫表单 Deep Web Web crawler form

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1刘伟,孟小峰,等.Deep Web数据集成问题研究[R].WAMDM技术报告,2006.
2Michael K Bergman. The Deep Web: Surfacing Hidden Value [ EB/OL]. http ://www. brightplanet. com/resources/ details/deepweb.html,2001-09-24.
3Sriram R. Crawling the hidden Web [ C ]// Proceedings of VLDB. Rome Italy ,2001 : 129-138.
4Luciano B. Searching for hidden-Web databases [ C ]//Proceedings of WebDB. Baltimore, Maryland, USA ,2005 : 1-6.
5郑冬冬,赵朋朋,崔志明.Deep Web爬虫研究与设计[J].清华大学学报（自然科学版）,2005,45(S1):1896-1902. 被引量：28
6凌妍妍,刘伟,王仲远,艾静,孟小峰.Deep Web数据集成中的实体识别方法[J].计算机研究与发展,2006,43(z3):46-53. 被引量：4
7Lueiano B. Siphoning hidden-Web data through keywordbased interfaces [ C ]//Proceedings of SBBD. Brasilia, Brazil, 2004 : 309 -321.
8Bin H. Statistical schema matching across Web query interfaces [ C ]//Proceedings of SIGMOD. New York, NY, USA, 2003:217-228.
9Cope J, Craswell N, Hawking D. Automated discovery of search interfaces on the Web [ C ]//Proceedings of ADC. Australia,2003:181-189.
10Chang K C,He B. Structured databases on the Web:Observations and implications [ J ]. SIGMOD Record, 2004,33 (3) :61-70.

二级参考文献16

1[1]Kevin Chen-Chuan Chang,Bin He,Chengkai Li,et al.Structured databases on the Web:Observations and implications.SIGMOD Record,2004,33(3):61-70
2[2]W Frakes,R Baeza-Yates.Information Retrieval:Data Structures and Algorithms.Englewood Cliffs,NJ:Prentice Hall,1992
3[3]W William.Cohen:Integration of heterogeneous databases without common domains using queries based on textual similarity.SIGMOD Conf,Seattle,Washington,1998
4[4]Sunita Sarawagi.Anuradha bhamidipaty.Interactive deduplication using active learning.KDD,Edmonton,Alberta,Canada,2002
5[5]E Winkler.The state of record linkage and current research problems.http://www.census.gov/srd/www/hyyear.html,1999
6[6]Sheila Tejada,Craig A Knoblock,Steven Minton.Learning domain-independent string transformation weights for high accuracy object identification.KDD,Acapulco,Mexico,2002
7[7]A Doan A,Y Lu,Y Lee,et al.Object matching for information integration:A profiler-based approach.IIWeb,2003
8Walker Troy.Automating the extraction of domain-specific information from the web-a case study for the genealogical domain[].Brigham Young University.2004
9Barbosa L,Freire J.Siphoning hidden-web data through keyword-based interfaces[].SBBD.2004
10Modica G,Gal A,Jamil H M.The use of machine-generated ontologies in dynamic information seeking[].Proceedings of the th International Conference on Cooperative Information Systems.2001

共引文献29

1苏晓珂,赵磊,黄青松.Deep Web中基于迭代的查询方式[J].云南民族大学学报（自然科学版）,2007,16(1):66-68. 被引量：1
2李越,孙彬,王东.XQuery Web搜索系统的设计与实现[J].新疆石油天然气,2007,3(2):94-96. 被引量：1
3寇月,申德荣,李冬,聂铁铮.一种基于语义及统计分析的DeepWeb实体识别机制[J].软件学报,2008,19(2):194-208. 被引量：18
4鞠彦辉,许燕.Deep Web信息资源开发策略研究[J].现代情报,2008,28(1):77-80. 被引量：1
5曾伟辉,李淼.深层网络爬虫研究综述[J].计算机系统应用,2008,17(5):122-126. 被引量：39
6刘汉兴,刘财兴.主题爬虫的搜索策略研究[J].计算机工程与设计,2008,29(12):3160-3162. 被引量：26
7陈方,谭爱平,成亚玲,文益民.主题爬虫技术研究综述[J].湖南工业职业技术学院学报,2008,8(5):13-16. 被引量：5
8兰洋,尤磊.Deep Web中基于关联规则的整体模式匹配[J].信阳师范学院学报（自然科学版）,2009,22(4):607-610.
9周二虎,张水平,胡洋.基于Deep Web检索的查询结果处理技术的应用[J].计算机工程与设计,2010,31(1):106-109.
10黄聪会,张水平,胡洋.主题Deep Web爬虫框架研究[J].计算机工程与设计,2010,31(5):929-931. 被引量：3

同被引文献44

1郑冬冬,赵朋朋,崔志明.Deep Web爬虫研究与设计[J].清华大学学报（自然科学版）,2005,45(S1):1896-1902. 被引量：28
2孙晨.利用机器学习技术获取WEB页面中的匹配数[J].中国科教创新导刊,2007(23):187-189. 被引量：1
3李学勇,田立军,谭义红,欧阳柳波,李国徽.一种基于非贪婪策略的网络蜘蛛搜索算法[J].计算技术与自动化,2004,23(2):35-39. 被引量：6
4李开荣,陈宏建,陈崚.一种动态自适应蚁群算法[J].计算机工程与应用,2004,40(29):149-152. 被引量：20
5郑冬冬,崔志明.Deep Web爬虫爬行策略研究[J].计算机工程与设计,2006,27(17):3154-3158. 被引量：13
6陶剑文.基于蚁群计算的自适应Web检索算法设计[J].计算机工程与应用,2007,43(15):163-165. 被引量：1
7蒋玲艳,张军,钟树鸿.蚁群算法的参数分析[J].计算机工程与应用,2007,43(20):31-36. 被引量：32
8DORIGO M, MANIEZZO V, COLORNI A. The ant system: optimization by a colony of cooperating agents[J]. IEEE Transactions on Systems, Man and Cybernetics--Part B, 1996, 26(1): 29-41.
9MENCZER F, PANT G, SRINIVASAN N P. Topical Web crawler: evaluating adaptive algorithms[J]. ACM Transactions on Internet Technology, 2004(4): 378-419.
10Raghavan S, Garcia-Molina H. Crawling the Hidden Web[EB/ OL].http://www.dia.uniroma3.it/-vldbproc/017_129.pdf, 2010 -04-08.

引证文献5

1陈永彬,张琢,张添.一种基于蚁群算法的主题爬虫搜索策略[J].微型机与应用,2011,30(1):53-56. 被引量：4
2郭少友,赵善义,李建平,王斌.基于数据库分类的deep web爬行器研究[J].情报科学,2011,29(10):1575-1579.
3姚双良.基于主题的Deep Web聚焦爬虫研究与设计[J].西北师范大学学报（自然科学版）,2013,49(2):40-43. 被引量：2
4扎西吉,才智杰.一种藏语语料网页数据的采集方法[J].通讯世界,2017,23(9):115-116. 被引量：1
5刘宇,郑成焕.基于Scrapy的深层网络爬虫研究[J].软件,2017,38(7):111-114. 被引量：29

二级引证文献36

1彭攀峰,刘波.基于农业信息化的垂直搜索引擎的分析与设计[J].农机化研究,2012,34(5):95-99. 被引量：1
2武昊,廖安平,何超英,侯东阳.基于主题相关度的地理信息Web服务爬虫研究[J].地理与地理信息科学,2012,28(2):27-30. 被引量：12
3梁士金.基于聚焦爬虫的编目数据搜集模型构建[J].图书馆学研究,2013(13):78-80.
4谷俊,翁佳,许鑫.面向情报获取的主题采集工具设计与实现[J].图书情报工作,2014,58(20):91-99. 被引量：2
5张玉明,张远远.基于大数据的小微企业统计信息采集策略[J].统计与决策,2017,33(14):178-181. 被引量：4
6刘贵平,刘娜,段红义.基于聚焦网络爬虫技术的人才招聘数据采集[J].电脑编程技巧与维护,2018(5):69-70. 被引量：2
7马艳辉,刘进,黄伟恺,吴钧,蔡梅松,李宇平.企业内网内容检索系统的设计与实现[J].电脑编程技巧与维护,2018(7):97-100.
8云洋.基于Scrapy的网络爬虫设计与实现[J].电脑编程技巧与维护,2018(9):19-21. 被引量：2
9范顺利,周亦敏.基于云平台的网页抓取架构的研究与设计[J].计算机时代,2018(9):21-23.
10张晓.一种网络多模态语料库构建方法[J].软件导刊,2018,17(11):49-51.

1陈珂,陈小英,徐科.Hidden Web信息获取[J].计算机时代,2007(5):54-56. 被引量：3
2肖毅,张林,聂笑一.基于WEB挖掘的网络爬虫设计与实现[J].计算机系统应用,2013,22(9):60-63. 被引量：9
3郭若飞,蔡欣宝,赵朋朋,崔志明.面向Deep Web的Ajax查询接口技术研究[J].苏州大学学报（工科版）,2010,30(3):1-4.
4段青玲,华松青.使用CGI编写网络机器人程序[J].程序员,2001(3):99-100.
5薛帆,顾兆军,王静,张俊.面向校园网的搜索引擎CAUCIIC[J].中国民航大学学报,2005,23(z1):134-136.
6卫锋,刘烜,苏庆华.基于海量URL数据存取的快速文件系统设计分析[J].信息通信,2012,25(6):89-90.
7田野,丁岳伟.基于关键词相关度的Deep Web爬虫爬行策略[J].计算机工程,2008,34(15):220-222. 被引量：7
8计算机系统结构[J].中国学术期刊文摘,2007,13(2):216-218.
9康丽萍.浅谈Visual FoxPro6.0的教学[J].农业网络信息,2005(7):48-49.
10郭浩,陆余良,刘金红.一种基于状态转换图的Ajax爬行算法[J].计算机应用研究,2009,26(11):4266-4269. 被引量：6

计算机与现代化

2009年第3期

浏览历史

内容加载中请稍等...

一种Deep Web爬虫的设计与实现被引量：5

参考文献15

二级参考文献16

共引文献29

同被引文献44

引证文献5

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

一种Deep Web爬虫的设计与实现 被引量：5

参考文献15

二级参考文献16

共引文献29

同被引文献44

引证文献5

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

一种Deep Web爬虫的设计与实现被引量：5