期刊文献+

一种基于改进BFS算法的主题搜索技术研究 被引量:1

An Improved Best-First Search Algorithm Based Focused Crawling Research
原文传递
导出
摘要 通过对Web主题爬行器在预测链接优先级时所用到的特征因子的细化和重新分类,引入收割率和媒体类型两个新特征作为相关性判断依据,提出一种改进的最好优先搜索算法。该算法采用"细粒度"策略过滤不相关网页,选取多个角度有代表性的特征因子构造链接优先级计算公式,以达到全面揭示和预测链接主题的目的。通过与其他三类主题搜索算法的小规模实验比较,证明改进算法在收割率和平均提交链接数上效果较好。 This paper introduces two new features harvest rate and media type as the basis to judge relevance, by refining and reclassifying all kinds of characteristic factors that are used by focused crawlers to predict the priority of Web links, and proposes an improved Best - First Search algorithm. The algorithm uses "fine - grained" policy filtering irrelevant Web pages, selects multiple angles representative characteristic factors and constructs a links priority formula to reveal and predict the subjects of Web links comprehensively. The small - scale experiment comparing with the other three topic search algorithms demonstrates that the improved algorithm has a better performance on harvest rate and the average number of links submitted.
作者 乔建忠
出处 《现代图书情报技术》 CSSCI 北大核心 2013年第7期28-35,共8页 New Technology of Library and Information Service
关键词 主题搜索 搜索算法 最好优先搜索算法 主题爬行器 特征因子 Focused crawling Search algorithm Best - First Search algorithm Focused crawler Characteristic factor
  • 相关文献

参考文献18

  • 1Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic - specific Web Resource Discovery [ J ]. Computer Networks, 1999, 31 ( 11 - 16) : 1623 - 1640.
  • 2Russell S, Norvig P. Artificial Intelligence: A Modem Approach [ M]. The 2nd Edition. Upper Saddle River, New Jersey: Pren- tice Hall, 2003 : 94 - 95.
  • 3Chakrabarti S. Mining the Web: Discovering Knowledge from Hy- pertext Data [ M ]. San Francisco: Morgan -Kaufmann Publishers, 2002:270 - 279.
  • 4Haveliwala T H. Topic - Sensitive PageRank : A Context - Sensi- tive Ranking Algorithm for Web Search[ J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15 (4) :784 - 796.
  • 5Bharat K, Henzinger M R. Improved Algorithms for Topic Distil- lation in a Hyperlinked Environment [ C ]. In : Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA : ACM, 1998:104 - 111.
  • 6Pandey S, Olston C. Crawl Ordering by Search Impact [ C ]. In : Proceedings of the International Conference on Web Search and Web Data Mining(WSDM'08). New York, NY, USA: ACM, 2008: 3 - 14.
  • 7夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J].现代图书情报技术,2007(5):41-44. 被引量:8
  • 8Brin S, Page L. The Anatomy of a Large - Scale Hypertextual Web Search Engine[ J]. Computer Networks and ISDN Systems, 1998, 30(1 -7) : 107 -117.
  • 9Kleinberg J M. Authoritative Sources in a Hyperlinked Environment [J]. Journal of the ACM,1999 ,46( 5 ) :604 -632.
  • 10Shchekotykhin K, Jannach D, Friedrich G. xCrawl : A High - re- call Crawling Method for Web Mining[ C ]. In : Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550 - 559.

二级参考文献45

  • 1周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:155
  • 2赵焕洲,唐爱民.对两种知识组织系统——叙词表与Ontology的比较研究[J].情报理论与实践,2005,28(5):469-471. 被引量:12
  • 3McCallum A, Nigam K, Rennie J, et al. Building domain-specific search engine with machine learning techniques [A]. AAAI Spring Symposium on Intelligent Agents in Cyberspace, Stanford University,USA,1999.
  • 4Chakrabarti S M, van den Berg H, Dom B. Focused crawling: a new approach to topic-specific Web resource discovery [J]. Computer Networks,1999,31(11-16):1 623-1 640.
  • 5Diligenti M, Coetzee F M, Lawrence S, et al. Focused crawling using context graphs [A]. 26th International Conference on Very Large Database, Cairo,Egypt, 2000.
  • 6Chakrabarti S, Kunal P, Mellela S. Accelerated focused crawling through online relevance feedback [A]. The Eleventh International Conference on World Wide Web, Hawaii,USA,2002.
  • 7Nigam K. Using unlabeled data to improve text classification [D]. Pittsburgh, USA: School of Computer Science, Carnegie Mellon University, 2001.
  • 8Jing Peng, Williams R. Incremental multi-step Q-learning [J]. Machine Learning,1996,22(1-3):283-290.
  • 9Wiering M, Schmidhuber J. Fast online Q(λ)[J]. Machine Learning,1998,33(1):105-115.
  • 10Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic - Specific Web Resource Discovery [ J ]. Com- puter Networks : The International Journal of Computer and Telecom- munications Networking, 1999, 31 ( I 1 - 16 ) : 1623 - 1640.

共引文献29

同被引文献16

引证文献1

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部