聚焦爬行中网页爬行算法的改进被引量：2

The Extension of Focused Crawling Strategy

下载PDF

导出

摘要因特网的迅速发展对万维网信息的查找与发现提出了巨大的挑战。对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页,为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究思路和方法。该文针对聚焦爬虫这一研究热点,对现今聚焦爬虫的爬行方法(主要是网页分析算法和网页搜索策略)做了深入分析和对比,提出了一种改进的聚焦爬行算法。这种基于类间规则的聚焦爬行方法借助baseline聚焦爬虫的架构,应用朴素的贝叶斯分类器并利用主题团间链接的统计关系构造规则找到在一定链接距离内的"未来回报"页面,并通过实验对该算法的性能进行分析、评价,证明其对聚焦爬虫的爬行收获率和覆盖率有很好的改善。 A focused crawler gathers relevant Web pages on a particular topic.In our work, we started with a focused-crawling approach designed by Soumen Chakrabarti, Martin van den Berg and Byron Dom, called baseline crawler. Building on this crawler, we developed a rule-based crawler, which uses simple rules derived from interclass (topic) linkage patterns to decide its next move. This rule-based crawler also enhances the baseline crawler by supporting tunneling.Initial performance results show that this rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler's harvest rate and coverage.

作者谭骏珊陈可钦 TAN Jun-shan, CHEN Ke-qin (Computer Science, Central South University of Forestry and Technology, Changsha 410004, China)

机构地区中南林业科技大学计算机科学学院

出处《电脑知识与技术》 2008年第12Z期2145-2146,2149,共3页 Computer Knowledge and Technology

关键词 baseline聚焦爬虫朴素的贝叶斯分类器未来回报率基于规则的聚焦爬虫通道 baseline crawler nave-bayesian classifier future benefit rate rule-based crawler tunneling

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Chau N,Chen H.Personalized and Focused Web Spiders[].Web Intelligence.2003
2Chakrabarti S.Mining the Web:Discovering Knowledge from Hypertext Data[]..2003
3Bergmark D,Lagoze C,Sbityakov A.Focused Crawler,Tunneling,and Digital Libraries[].Procth European Conf Research and Ad-vanced Technology for Digital Libraries.2002
4Cormen T H,Leiserson C E,Rivest R L.Introduction to Algorithms[]..1990
5Diligenti M,Coetzee F M,et al.Focused crawling using context graphs[].Procof the International Conference on Very Large Database(VLDB′).2000
6Chakrabarti S,van den Berg M,Dom B.Focused crawling:A new approach to topic-specific web resource discovery[].Computer Networks.1999

同被引文献17

1夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J].现代图书情报技术,2007(5):41-44. 被引量：8
2Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic - specific Web Resource Discovery [ J ]. Computer Networks, 1999, 31 ( 11 - 16) : 1623 - 1640.
3Russell S, Norvig P. Artificial Intelligence: A Modem Approach [ M]. The 2nd Edition. Upper Saddle River, New Jersey: Pren- tice Hall, 2003 : 94 - 95.
4Chakrabarti S. Mining the Web: Discovering Knowledge from Hy- pertext Data [ M ]. San Francisco: Morgan -Kaufmann Publishers, 2002:270 - 279.
5Haveliwala T H. Topic - Sensitive PageRank : A Context - Sensi- tive Ranking Algorithm for Web Search[ J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15 (4) :784 - 796.
6Bharat K, Henzinger M R. Improved Algorithms for Topic Distil- lation in a Hyperlinked Environment [ C ]. In : Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA : ACM, 1998:104 - 111.
7Pandey S, Olston C. Crawl Ordering by Search Impact [ C ]. In : Proceedings of the International Conference on Web Search and Web Data Mining(WSDM'08). New York, NY, USA: ACM, 2008: 3 - 14.
8Brin S, Page L. The Anatomy of a Large - Scale Hypertextual Web Search Engine[ J]. Computer Networks and ISDN Systems, 1998, 30(1 -7) : 107 -117.
9Kleinberg J M. Authoritative Sources in a Hyperlinked Environment [J]. Journal of the ACM,1999 ,46( 5 ) :604 -632.
10Shchekotykhin K, Jannach D, Friedrich G. xCrawl : A High - re- call Crawling Method for Web Mining[ C ]. In : Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550 - 559.

引证文献2

1乔建忠.基于锚与链接文本扩展的KBES算法隧道策略[J].现代图书情报技术,2011(3):45-50. 被引量：1
2乔建忠.一种基于改进BFS算法的主题搜索技术研究[J].现代图书情报技术,2013(7):28-35. 被引量：1

二级引证文献2

1乔建忠.一种基于改进BFS算法的主题搜索技术研究[J].现代图书情报技术,2013(7):28-35. 被引量：1
2沈平,桂志鹏,游兰,胡凯,吴华意.一种主动发现网络地理信息服务的主题爬虫[J].地球信息科学学报,2015,17(2):185-190. 被引量：4

1曾水香,罗林波.基于改进Hits算法的多主题爬虫研究与实现[J].福建电脑,2010,26(5):88-89. 被引量：2
2傅向华,冯博琴,马兆丰,何明.可在线增量自学习的聚焦爬行方法[J].西安交通大学学报,2004,38(6):599-602. 被引量：18
3尹文科,宗士强,王珩.基于预期剩余能量模型的聚焦爬行方法[J].计算机工程与应用,2015,51(24):120-125.
4沈达阳,孙茂松.万维网知识挖掘方法的研究[J].计算机科学,2000,27(2):79-82. 被引量：14
5蔡明,倪贤贵.基于超链接和内容相关度的综合爬行策略[J].微计算机信息,2008,24(27):204-205.
6方巍,胡鹏昱,赵朋朋,崔志明.基于语义的Deep Web数据源自动发现技术[J].微电子学与计算机,2007,24(9):60-63. 被引量：4
7姜婷婷,陆伟.基于万维网信息生态系统的信息构建[J].情报学报,2004,23(3):340-346. 被引量：6
8李璐,张国印,李正文.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122. 被引量：12
9王会.基于WWW聚类引擎的研究现状及问题[J].科技信息,2009(18):72-72.
10宛玲.万维网信息资源组织中域名的规范化问题[J].情报学报,2001,20(1):26-31. 被引量：2

电脑知识与技术

2008年第12Z期

浏览历史

内容加载中请稍等...

聚焦爬行中网页爬行算法的改进被引量：2

参考文献6

同被引文献17

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

聚焦爬行中网页爬行算法的改进 被引量：2

参考文献6

同被引文献17

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

聚焦爬行中网页爬行算法的改进被引量：2