期刊文献+

Web信息主题采集技术研究 被引量:17

Technologies of Focused Web Crawling
原文传递
导出
摘要 简单介绍主题信息采集系统;从5个方面对其核心技术进行深入研究,包括种子页面生成、主题表示、相关度计算策略、爬行策略以及结束搜索策略等;详细讨论种子页面生成的人工方式、自动方式及混合方式,基于关键词的主题表示与基于Ontology的主题表示,多种相关度计算启发式策略比较,基本爬行策略与隧道技术以及结束爬行的多种情形等;在分析相关技术的算法、特点与应用情况的同时,针对主题信息采集特点提出相应的改进意见。 This paper briefly introduces the core technologies of the focused Web crawler. Three main modes are used to create seed URLs. Several methodical technologies, such as keyword- and ontology-based topic description, various heuristic functions and algorithms, tunneling methods, basic focused crawling strategies and strategies to stop crawling, are discussed and analyzed in this paper. Furthermore, suggestions are put forward to improve the Web crawling technologies by comparing the merits and demerits of focus crawling algorithms.
作者 李春旺
出处 《图书情报工作》 CSSCI 北大核心 2005年第4期77-80,70,共5页 Library and Information Service
关键词 WEB 搜索引擎 主题采集 技术 Web search engine focused crawling technology
  • 相关文献

参考文献48

  • 1Deep Web white paper. [2003-12-06 ]. http://www. complete-planet. com/Tutorials/Deep Web/index. asp.
  • 2Anthes G H. Search engines-The future. [ 2004 - 05 - 31 ]. http://www. computerworld. com/softwaretopics/software/story/0,10801,91841,00. html.
  • 3Chakrabarti S, Dom B E, Kumar S R et al. Mining the Web's link structure. IEEE Computer, 1999,32(8): 60-67.
  • 4Chakrabarti S, van den Berg M, Dom B E. Distributed hypertext resource discovery through examples. [ 2004 -05 -26 ]. http ://citeseer. ist. psu. edu/chaklabariti99ditributed. html.
  • 5Chakrabartia S, Doma B, Raghavana P et al. Automatic resource compilation by analyzing hyperlink structure and associated text.[ 2004 - 05 - 26 ]. http ://cindoc. csic. es/cybermetrics/pdf/25. pdf.
  • 6Yang Y S, Wang H. Implementation of focused crawler. [2004 -05 -25 ]. http://www. cs. ust. hk/- ysyang/courses/comp630d/630dreport. pdf.
  • 7Heydon A, Najork M. Mercator : A scalable,extensible Web crawler. World Wide Web, 1999,2(4) :219 -229. [2004 -07 -02].http://research. compaq. com/SRC/mereator/papers/www/paper.pdf.
  • 8Melnik S, Garcia- Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to Schema matching. [ 2004 - 05 - 30 ]. http ://www -db. stanford. edu/- melnik/pub/melnik_ICDE02. pdf.
  • 9Ehrig M. Ontology-Focused Crawling of Documents and Relational Metadata. [ Master thesis ]. University of Karlsruhe, Germany.2002. [2004 -05 - 10]. http://projekte. learninglab. uni - hannover. de/pub/bscw. cgi/d5266/Ehrig - Ontology_Focused_Crawling_of_Documents_and_Relational_Metadata - Thesis. pdf.
  • 10Ehrig M, Maedche A. Ontology-focused crawling of Web documents. [2004 -05 -10]. http://www. aifb. uni-karlsruhe. de/WBS/meh/publications/ehrig03 ontology. pdf.

二级参考文献29

  • 1[8]Cho,Molina. Synchronizing a database to improve freshness. In:Junghoo Cho, Hector Garcia-Molina, eds. Proc. of 2000 ACM Intl. Conf. on Management of Data(SIGMOD),May 2000
  • 2[9]Cho, Molina, Page. Efficient Crawling Through URL Ordering.In: Junghoo Cho,Hector Garcia-Molina and Lawrence Page, eds.Proc. of the Seventh Intl. World Wide Web Conf. Toronto,Canada,May 1999
  • 3[10]Edwards,et al. An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In: J. Edwards, K. McCurley, J.Tomlin,eds. Proc. of the 10th Intl. World Wide Web Conf. Hong Kong ,May 2001
  • 4[11]Heydon ,Najork .Mercator:A Scalable,Extensible Web Crawler.A. Heydon and M. Najork. In World Wide Web Journal, Dec.1999. 219~229
  • 5[12]Kamba T,Bharat K,Albers M. The Krakatoa Chronicle - An Interactive, Personalized, Newspaper on the Web. In: Proc. of WWW 4,Boston, USA,Dec. 1995
  • 6[13]Kahle B. Preserving the Internet,Scientific American,March 1997
  • 7[14]Koster M. The Web Robots Pages. 1999
  • 8[15]Lawrence S,Giles C L. Accessibility of information on the Web.Nature, 1999,400(6740) :107~109
  • 9[16]Letizia. An Agent That Assists Web Browsing. In:H. Lieberman,ed. Proc. of the Intl. Joint Conf. on AI,Montreal ,Canada,Aug.1995
  • 10[17]Is Agent-Based Online Search Feasible?. In: F. Menzcer, ed.Working Notes of the AAAI Spring Symposium on Intelligent Agents in Cyberspace,Stanford,USA,March 1999

共引文献70

同被引文献316

引证文献17

二级引证文献65

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部