期刊文献+

On-line topical importance estimation:an effective focused crawling algorithm combining link and content analysis 被引量:6

On-line topical importance estimation:an effective focused crawling algorithm combining link and content analysis
原文传递
导出
摘要 Focused crawling is an important technique for topical resource discovery on the Web.The key issue in focused crawling is to prioritize uncrawled uniform resource locators(URLs) in the frontier to focus the crawling on relevant pages.Traditional focused crawlers mainly rely on content analysis.Link-based techniques are not effectively exploited despite their usefulness.In this paper,we propose a new frontier prioritizing algorithm,namely the on-line topical importance estimation(OTIE) algorithm.OTIE combines link-and content-based analysis to evaluate the priority of an uncrawled URL in the frontier.We performed real crawling experiments over 30 topics selected from the Open Directory Project(ODP) and compared harvest rate and target recall of the four crawling algorithms:breadth-first,link-context-prediction,on-line page importance computation(OPIC) and our OTIE.Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate.Moreover,OTIE is much faster than the traditional focused crawling algorithm. Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we propose a new frontier prioritizing algorithm, namely the on-line topical importance estimation (OTIE) algorithm. OTIE combines link- and content-based analysis to evaluate the priority of an uncrawled URL in the frontier. We performed real crawling experiments over 30 topics selected from the Open Directory Project (ODP) and compared harvest rate and target recall of the four crawling algorithms: breadth-first, link-context-prediction, on-line page importance computation (OPIC) and our OTIE. Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate. Moreover, OTIE is much faster than the traditional focused crawling algorithm.
出处 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2009年第8期1114-1124,共11页 浙江大学学报(英文版)A辑(应用物理与工程)
基金 Project (No.2007C23086) supported by the Science and Technology Plan of Zhejiang Province,China
关键词 检索算法 专题 估计 统一资源定位符 资源发现 有效利用 优先算法 广度优先 Focused crawlers, Topical crawlers, PageRank, Classifiers, On-line topical importance estimation (OTIE) algorithm
  • 相关文献

参考文献12

  • 1P. Srinivasan,F. Menczer,G. Pant.A General Evaluation Framework for Topical Crawlers[J].Information Retrieval.2005(3)
  • 2Christopher J.C. Burges.A Tutorial on Support Vector Machines for Pattern Recognition[J].Data Mining and Knowledge Discovery.1998(2)
  • 3Menczer,F.ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery[].Proc th Int Conf on Machine Learning.1997
  • 4Page,L.,Brin,S.,Motwani,R.,Winograd,T.The Pag-erank Citation Algorithm: Bringing Order to the Web[].Technical ReportStanford Digital Library TechnologiesStanford InfoLab.1998
  • 5Pant,G.,Srinivasan,P.,Menczer,F.Exploration versus Exploitation in Topic Driven Crawlers[].Proc th World Wide Web Workshop on Web Dynamics.2002
  • 6Aggarwal,C.C.Collaborative Crawling: Mining User Experiences for Topical Resource Discovery[].Proc th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining.2002
  • 7Chakrabarti,S.,van den Berg,M.,Dom,B.Focused crawling: a new approach to topic-specific Web resource discovery[].Computer Networks.1999
  • 8Davison,B.D.Topical Locality in the Web[].Proc rdAnnual Int ACM SIGIR Conf.2002
  • 9Guan,Z.,Wang,C.,Chen,C.,Bu,J.,Wang,J.Guide Focused Crawler Efficiently and Effectively Using On-line Topical Importance Estimation[].Proc st An-nual Int ACM SIGIR Conf on Research and Develop-ment in Information Retrieval.2008
  • 10Haveliwala,T.H.Topic-sensitive PageRank[].Proc th Int Conf on World Wide Web.2002

同被引文献38

引证文献6

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部