期刊文献+

Deep Web爬虫爬行策略研究 被引量:13

On research of deep web crawler's crawling strategy
下载PDF
导出
摘要 如今Web上越来越多的信息可以通过查询接口来获得,为了获取某DeepWeb站点的页面用户不得不键入一系列的关键词集。由于没有直接指向DeepWeb页面的静态链接,当前大多搜索引擎不能发现和索引这些页面。然而,近来研究表明DeepWeb站点提供的高质量的信息对许多用户来说是非常有价值。这里研究了怎样建立起一个有效的DeepWeb爬虫,它可以自动发现和下载DeepWeb页面。由于DeepWeb惟一“入口点”是查询接口,DeepWeb爬虫设计面对的主要挑战是怎样对查询接口自动产生有意义的查询。这里提出一种针对查询接口查询自动产生问题的理论框架。通过在实际DeepWeb站点上的实验证明了此方法是非常有效的。 As an ever-increasing amount of information on the web today is available through search interfaces, users have to key in a set ofkeywords in order to access the pages from certain web sites, which are often referred to as the hidden web or the deep web. Since there is no static links to the hidden web pages, search engines cannot discover and index such pages. However, according to recent studies, the content provided by many hidden web sites is often of very high quality and can be extremely valuable to many users. How to build an effective hidden web crawler that can autonomously discover and download pages from the hidden web is studied. Since the only "entry point" to a hidden web site is a query interface, the main challenge to a hidden web crawler is how to automatically generate meaningful queries for issue to the site. A theoretical framework to investigate the query generation problem for the hidden web and we propose effective policies for generating queries automatically is provided. Experiment shows that these policies are effective.
出处 《计算机工程与设计》 CSCD 北大核心 2006年第17期3154-3158,共5页 Computer Engineering and Design
基金 教育部高校博士学科点科研基金项目(20040285016) 江苏省高技术研究基金项目(BG2005019)。
关键词 DEEP WEB DEEP WEB爬虫 查询选择 查询效能 适应性爬行算法 deep web deep web crawler query selection query efficiency adaptive algorithm
  • 相关文献

参考文献12

  • 1Bin He,Mitesh Patel,Zhen Zhang,et al.Accessing the deep web:A survey[EB/OL].2004.http://eagle.cs.uiuc.edu/tr/dwsurvey-tr-hpzc-ju 104.pdf
  • 2Chang K C C,He B,Li C,et al.Structured databases on the web:Observations and implications[C].SIGMOD Record,33 (3),2004-09.
  • 3Raghavan S,Garcia-Molina H.Crawling the hidden web[C].Roma,Italy:Proceedings of the 27th International Conference on Very Large Data Bases,2001.129-138.
  • 4Cormen T H,Leiserson C E,Rivest R L.Introduction to algorithms[M].2nd Edition.MIT Press/McGraw Hill,2001.
  • 5Ipeirotis P,Gravano L.Distributed search over the hidden web:Hierarchical database sampling and selection[C].VLDB,2002.
  • 6Ntoulas A,Cho J,Olston C.What's new on the web? The evolution of the web from a search engine perspective[Z].WWW,2004.
  • 7Barbosa L,Freire J.Siphoning hidden-web data through keyword-based interfaces[C].SBBD,2004.
  • 8Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the web[C].14th Australasian conference on Data Base technologies,2003.
  • 9He B,Chang K C C.Statistical schema matching across web query interfaces[C].SIGMOD Conference,2003.
  • 10Ipeirotis P G,Gravano L,Sahami M.Probe,count,and classify:Categorizing hidden web databases[C].SIGMOD,2001.

同被引文献93

引证文献13

二级引证文献93

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部