期刊文献+

聚焦爬行中网页爬行算法的改进 被引量:2

The Extension of Focused Crawling Strategy
下载PDF
导出
摘要 因特网的迅速发展对万维网信息的查找与发现提出了巨大的挑战。对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页,为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究思路和方法。该文针对聚焦爬虫这一研究热点,对现今聚焦爬虫的爬行方法(主要是网页分析算法和网页搜索策略)做了深入分析和对比,提出了一种改进的聚焦爬行算法。这种基于类间规则的聚焦爬行方法借助baseline聚焦爬虫的架构,应用朴素的贝叶斯分类器并利用主题团间链接的统计关系构造规则找到在一定链接距离内的"未来回报"页面,并通过实验对该算法的性能进行分析、评价,证明其对聚焦爬虫的爬行收获率和覆盖率有很好的改善。 A focused crawler gathers relevant Web pages on a particular topic.In our work, we started with a focused-crawling approach designed by Soumen Chakrabarti, Martin van den Berg and Byron Dom, called baseline crawler. Building on this crawler, we developed a rule-based crawler, which uses simple rules derived from interclass (topic) linkage patterns to decide its next move. This rule-based crawler also enhances the baseline crawler by supporting tunneling.Initial performance results show that this rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler's harvest rate and coverage.
作者 谭骏珊 陈可钦 TAN Jun-shan, CHEN Ke-qin (Computer Science, Central South University of Forestry and Technology, Changsha 410004, China)
出处 《电脑知识与技术》 2008年第12Z期2145-2146,2149,共3页 Computer Knowledge and Technology
关键词 baseline聚焦爬虫 朴素的贝叶斯分类器 未来回报率 基于规则的聚焦爬虫 通道 baseline crawler nave-bayesian classifier future benefit rate rule-based crawler tunneling
  • 相关文献

参考文献6

  • 1Chau N,Chen H.Personalized and Focused Web Spiders[].Web Intelligence.2003
  • 2Chakrabarti S.Mining the Web:Discovering Knowledge from Hypertext Data[]..2003
  • 3Bergmark D,Lagoze C,Sbityakov A.Focused Crawler,Tunneling,and Digital Libraries[].Procth European Conf Research and Ad-vanced Technology for Digital Libraries.2002
  • 4Cormen T H,Leiserson C E,Rivest R L.Introduction to Algorithms[]..1990
  • 5Diligenti M,Coetzee F M,et al.Focused crawling using context graphs[].Procof the International Conference on Very Large Database(VLDB′).2000
  • 6Chakrabarti S,van den Berg M,Dom B.Focused crawling:A new approach to topic-specific web resource discovery[].Computer Networks.1999

同被引文献17

  • 1夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J].现代图书情报技术,2007(5):41-44. 被引量:8
  • 2Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic - specific Web Resource Discovery [ J ]. Computer Networks, 1999, 31 ( 11 - 16) : 1623 - 1640.
  • 3Russell S, Norvig P. Artificial Intelligence: A Modem Approach [ M]. The 2nd Edition. Upper Saddle River, New Jersey: Pren- tice Hall, 2003 : 94 - 95.
  • 4Chakrabarti S. Mining the Web: Discovering Knowledge from Hy- pertext Data [ M ]. San Francisco: Morgan -Kaufmann Publishers, 2002:270 - 279.
  • 5Haveliwala T H. Topic - Sensitive PageRank : A Context - Sensi- tive Ranking Algorithm for Web Search[ J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15 (4) :784 - 796.
  • 6Bharat K, Henzinger M R. Improved Algorithms for Topic Distil- lation in a Hyperlinked Environment [ C ]. In : Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA : ACM, 1998:104 - 111.
  • 7Pandey S, Olston C. Crawl Ordering by Search Impact [ C ]. In : Proceedings of the International Conference on Web Search and Web Data Mining(WSDM'08). New York, NY, USA: ACM, 2008: 3 - 14.
  • 8Brin S, Page L. The Anatomy of a Large - Scale Hypertextual Web Search Engine[ J]. Computer Networks and ISDN Systems, 1998, 30(1 -7) : 107 -117.
  • 9Kleinberg J M. Authoritative Sources in a Hyperlinked Environment [J]. Journal of the ACM,1999 ,46( 5 ) :604 -632.
  • 10Shchekotykhin K, Jannach D, Friedrich G. xCrawl : A High - re- call Crawling Method for Web Mining[ C ]. In : Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550 - 559.

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部