期刊文献+

基于内容与链接分析的主题爬虫研究与设计 被引量:2

Research and Design on Topical Crawler Based on Analysis of Content and Link
下载PDF
导出
摘要 现存主题爬虫算法在抓取主题网页方面,其准确性不是很高。本文提出一种基于文本内容评价与网页链接评价的主题网页抓取方法。首先计算当前网页与主题的相关度,然后将相关度值与给定阈值进行比较决定当前网页是丢弃还是存储,同时相关度值的大小也决定了待爬链接队列中URL的优先权,此模型考虑了主题网页的准确率与覆盖率之间的平衡。新设计的主题爬虫算法在抓取主题网页方面,其准确性有一定程度的提高。 In the aspect of grasping the topical webpage to the existing topical crawler algorithm, its accuracy is not high. This paper presents a topical webpage grasping method which based on evaluation of text content and webpage link. First it calculates the correlation of current webpage and theme, and then compares the correlation values with a given threshold to determine the current webpage is discarded or stored. At the same time the size of the correlation value also determines the priority of URL in the climbing link queue, this model takes into account the balance of topical webpage between accuracy and coverage. In the as- pect of grasping topical webpage to design the new topical crawler algorithm, its accuracy has been improved to some extent.
作者 舒奔 尹珂
出处 《计算机与现代化》 2014年第4期77-80,共4页 Computer and Modernization
关键词 主题爬虫 主题相关度 主题网页 topical crawler topical correlation topical webpage
  • 相关文献

参考文献10

二级参考文献36

  • 1欧阳柳波,李学勇,李国徽,王鑫.专业搜索引擎搜索策略综述[J].计算机工程,2004,30(13):32-33. 被引量:34
  • 2管建和,甘剑峰.基于Lucene全文检索引擎的应用研究与实现[J].计算机工程与设计,2007,28(2):489-491. 被引量:71
  • 3Pant G.,Srinivasan P.Learning to crawl:Comparing classification schemes[J].ACM Transactions on Information Systems,2005,23(4):430-462.
  • 4Menczcr F,Pant G,Srinivasan P,et al.Evaluating topic-driven web crawlers[C].Proc 24th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval,2001:241-249.
  • 5Assis G T,Laender AHF,Silva ASd,et al.The impact of term selection in genre-aware focused crawling[C].Proceedings of the 23rd ACM Symposium on Applied Computing,2008:1158-1163.
  • 6Liu H,Janssen JCM,Milios EE.Using HMM to learn user browsing patterns for focused web crawling[J].Data and Knowledge Engineering,2006,59(2):270-291.
  • 7Eda Baykan,Monika Rauch Henzinger,Ludmila Marian,et al.Purely URL-based topic classification[C].WWW,2009:1109-1110.
  • 8Li Jun,Kazutaka Furuse,Kazunori Yamaguchi.Focused crawling by exploiting anchor text using decision tree[C].ACM,2005:1190-1191.
  • 9Chakrabarti S,Van Den Berg M,Dom B.Focussed crawling:A new approach to topic specific resouree discovery[C].Proceedings of the WWW Conference,1999:545-562.
  • 10Deerwester S C,Dumais S T,Landaner T K,et al.Indexing by latent semantic analysis[J].Journal of the American Society of Information Science,1990,41(6):391-407.

共引文献41

同被引文献17

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部