期刊文献+

基于网页分块的Shark-Search算法 被引量:7

Improved Shark-Search algorithm based on page segmentation
下载PDF
导出
摘要 Shark-Search算法是一个经典的主题爬取算法.针对该算法在爬取噪音链接较多的Web页面时性能并不理想的问题,提出了基于网页分块的Shark-Search算法,该算法从页面、块、链接的多种粒度来更加有效的进行链接的选择与过滤.实验证明,改进的Shark-Search算法比传统的Shark-Search算法在查准率和信息量总和上有了质的提高. A Shark-Seareh algorithm is one of the classical algorithms for focused crawling. However, its performance is not ideal for crawling Web pages which contain too many noisy links. An improved Shark-Search algorithm based on page segmentation was proposed, which can accurately evaluate the relevance from three granularities: page, block and single link. Several experiments were carried out to verify that the improved Shark-Search algorithm can obtain significantly higher efficiency than traditional ones.
作者 陈军 陈竹敏
出处 《山东大学学报(理学版)》 CAS CSCD 北大核心 2007年第9期62-66,共5页 Journal of Shandong University(Natural Science)
基金 国家科技支撑计划子课题资助项目(2006BAH02A29) 山东省博士基金资助项目(2006BS01016)
关键词 Shark-Search算法 主题爬取 页面分块 相关性计算 Shark-Search algorithm focused crawling page segmentation relevance computation
  • 相关文献

参考文献12

  • 1中国互联网信息中心.第19次中国互联网络发展状况统计报告[EB/OL].(2007-01)[2007-06-20].http://www.cnnic.net.cn/index/0E/00/11/index.htm.
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 3NOVAK B. A survey of focused web crawling algorithms[C]// Proceedings of SIKDD 2004 at Muticonference IS. Slovennia: ACM, 2004: 55-58.
  • 4HERSOVICI M, JACOVI M, MAAREK Y, et al. The Shark- Search algorithm-an application: Tailored web site mapping [C]// Proceedings of the Seventh International World Wide Web Conference. Brisbane, Australia: Elsevier Science Pub- lishers B V, 1998: 317-326.
  • 5苏祺,项锟,孙斌.基于链接聚类的Shark-Search算法[J].山东大学学报(理学版),2006,41(3):139-143. 被引量:8
  • 6MENCZER F, PANT G, SRINIVASAN P. Topical web crawlers : Evaluating adaptive algorithms [J]. ACM Transactions on Intemet Technology, 2004, 4(4): 378-419.
  • 7MENCZER F, PANT G, RUIZ M, et al. Evaluating topicdriven Web crawlers[C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, USA: [s. n.], 2001:241-249.
  • 8BRA P De, HOUBEN G, KORNATZKY Y, et al. Information retrieval in distributed hypertexts [ C]// Proceedings of the 4th RIAO Conference. New York: [s.n.], 1994: 481-491.
  • 9LUO Fang-fang, CHEN Guolong, GUO Wenzhong. An im- proved“Fish-Search” algorithm for information retrieval [ C ]// Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering. [S. I.] : [ s.n.], 2005 : 523-528.
  • 10宋睿华,马少平,陈刚,李景阳.一种提高中文搜索引擎检索质量的HTML解析方法[J].中文信息学报,2003,17(4):19-26. 被引量:20

二级参考文献37

  • 1N. Craswell, D. Hawking, S. E. Robertson, Effective Site Finding Using Link Anchor Information,SIGIR 2001, 2001.
  • 2P. Buneman, Semistructured data, In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases Systems, 117- 121, 1997.
  • 3Kushmerick, N., Weld, D.S., and Doorenbos, R., Wrapper Induction for Information Extraction,Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, 729 - 735, 1997.
  • 4Carchiolo, V. ; Longheu, A. ; Malgeri, M., Structuring the Web, Database and Expert Systems Applications, 2000. Proceedings. 11th International Workshop on, 1123 - 1127, 2000.
  • 5Jinlin Chen, Baoyao Zhou, Jin Shi, HongJiang Zhang, Qiu Fengwu, Function-based object model towards website adaptation, WWW10, 587- 596, 2001.
  • 6Michal Cutler, Yungming Shih, Weiyi Meng, Using the Structure of HTML Documents to Improve Retrieval, Proceedings of the USENIX Symposium on Internet Technologies and Systems, 241- 251,1997.
  • 7S. Chakrabarti, B.Dom, D. Gibson, H. Kleinberg, P. Raghavan, S. Rajagopalan, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, WWWT, 1998.
  • 8EHRIG M, MAEDCHE A. Ontology-focused crawling of Web documents[A]. Proceedings of the 2003 ACM symposium on Applied computing[C], March 2003.
  • 9GUO Q, GUO H, ZHANG ZQ, et al. Schema Driven Topic Specific Web Crawling[A]. DASFAA[C], 2005.
  • 10GRAUPMANN J, BIWER M, ZIMMER C, et al. COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data[A]. Proceedings of the 30th VLDB Conference[C],2004.

共引文献179

同被引文献64

引证文献7

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部