期刊文献+

智能专题化信息搜集Crawler 被引量:4

A Topic-specific Intelligent Web Crawler System
下载PDF
导出
摘要 介绍了基于Web内容和结构挖掘的专题化智能Web爬行Crawler系统,并重点介绍其中CA(C&S)算法,该算法充分利用神经网络可以方便地模拟网络的拓扑结构和并行计算的特点,采用加强学习判断网页与主题的相关度,在进行相关度计算时,不考虑网页的全部内容,而通过提取网页的HTML描述中的重要标记,对Web网页进行内容和结构分析,从而判断爬行到的网页与主题的相关性,以提高信息搜集的效率和精确性。 This paper introduces the topic-specific intelligent Web Crawler system and its crawling algorithm based on Web content and structure mining. The algorithm takes full advantage of the characteristics of the neural network and can simulate the network topology conveniently and parallel calculation. The paper introduces the reinforcement learning to judge the relativity between the crawled page and the topic. When calculating the correlation, without regarding to the whole content of the Web page, but to abstract the important tags of HTML makeup of the Web page, to analyze the content and structure of the page, thereby judge the relativity between the crawled page and the topic, improve the efficiency and accuracy of collected information enormously.
出处 《计算机工程》 CAS CSCD 北大核心 2006年第3期57-59,共3页 Computer Engineering
基金 国家自然科学基金重点资助项目(69835001) 国家科技成果重点推广计划基金资助项目(2003EC000001)
关键词 专题化爬行 WEB挖掘 神经网络 加强学习 Topic-specific crawler Web mining Neural network Reinforcement learning
  • 相关文献

参考文献7

  • 1Menczer F,Srinivasan G P P,Ruiz M.Evaluating Topic-driven Web Crawlers[C].Proceedings of the 24th Annual International ACM/SIGIR Conference,2001.
  • 2韩家炜,孟小峰,王静,李盛恩.Web挖掘研究[J].计算机研究与发展,2001,38(4):405-414. 被引量:356
  • 3韩家炜 坎伯(加).数据挖掘[M].北京:机械工业出版社,2001.223-259.
  • 4王永庆.人工智能原理与方法[M].西安:西安交通大学出版社,1999..
  • 5Grama A,Karypis G,Kumar V,et al.Introduction to Parallel Computing (Second Edition)[M].Boston:Addison-Wesley,2003.
  • 6Brin S, Page L. The Anatomy of a Large Scale Hyper Textual Web Search Engine [C]. Proceeding of the WWW7 Conference, Elsevier,Australia, 1998: 107-117.
  • 7杨炳儒.基于内在机理的知识发现理论及应用[M].北京:电子工业出版社,2003..

二级参考文献5

  • 1Han J,Data Mining:Concepts and Techniques,2000年
  • 2Wang K,Proc of VLDB'97,1999年,363页
  • 3Zaiane O R,Proc Int Workshop Web Information and Data Management(WIDM'98),1998年,9页
  • 4Mobasher B,Tech Rep:TR96 0 5 0,1996年
  • 5Zaiane O R,Proc KDD'95,1995年,331页

共引文献375

同被引文献26

  • 1周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 2张娜,张化祥.基于超链接和内容相关度的检索算法[J].计算机应用,2006,26(5):1171-1173. 被引量:6
  • 3Rungsawang A, Angkawattanawit N. Learnable Topic-specific Web Crawler[J]. Journal of Network and Computer Applications, 2005, 28(2): 97-114.
  • 4Chakrabhik S, Vandenburg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery[C]//Proceedings of the 8th International World-Wide Web Conference. Toronto, Canada: [s. n.], 1999.
  • 5Liu Hongyu, MIuOS E, Janssen J. Probabilistic Models for Focused Web Crawling[C]//Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management. New York, USA: ACM Press, 2004.
  • 6Florescu D, Levy A, Mendelzon A. Database Techniques for the World-Wide Web: A Survey[J]. SIGMOD Record, 1998, 27(3): 59-74.
  • 7Wei Jiying, Wen Jirong. instance-based Schema Matching for Web Databases by Domain-specific Query Probing[C]//Proceedings of the 30th international Conference on VLDB. Toronto, Canada: [s. n.], 2004.
  • 8Roc1′oL.Using genetic algorithms to evolve a population of topical queries[J].Information Processing and Management,2008(44):1863-1878.
  • 9Soumen Chakrabarti.Focused crawling:a new approach to topic-specific Web resource di3scovery[J].Computer Networks,1999(31):1623-1640.
  • 10Andrei Z.Marc Najork.Efficient URL caching for World Wide Web crawling.ACM press,2003:679-689.

引证文献4

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部