期刊文献+

一种基于语义分析的主题爬虫算法 被引量:7

A Topic Crawler Algorithm Based on Semantic Analysis
下载PDF
导出
摘要 海量网页的存在及其量的急速增长使得通用搜索引擎难以为面向主题或领域的查询提供满意结果。本文研究的主题爬虫致力于收集主题相关信息,达到极大降低网页处理量的目的。它通过评价网页的主题相关度,并优先爬取相关度较高的网页。利用一种基于子空间的语义分析技术,并结合贝叶斯以及支持向量机,设计并实现了一个高效的主题爬虫。实验表明,此算法具有很好的准确性和高效性。 Massive web and its rapid growth make it difficult for general-purpose search engines to provide satisfactory results for the theme-or area-oriented queries. This paper studies the subject of gathering information relevant to the subject,to significantly reduce the amount of web pages dealing. By assessing the degree of Web pages,it gives priority to the crawling pages related to a higher degree. Using a subspace-based semantic analysis technique,combined with the Bayesian mechanism and support vector machine,we design and implement an efficient topic crawler. Experiments show that our algorithm has good accuracy and efficiency.
出处 《计算机工程与科学》 CSCD 北大核心 2010年第9期145-147,151,共4页 Computer Engineering & Science
关键词 主题爬虫 子空间 语义分析 支持向量机 topic crawler subspace semantic analysis support vector machine
  • 相关文献

参考文献7

  • 1傅向华,冯博琴,马兆丰,何明.可在线增量自学习的聚焦爬行方法[J].西安交通大学学报,2004,38(6):599-602. 被引量:18
  • 2Chakrabarti S,Dom B,Indyk P.Enhanced Hypertext Categorization Using Hyperlinks[C] ∥Proc of the ACM SIGMOD Int'l Conf on Management of Data,1998:307-318.
  • 3Bernardo J,Smith A.Bayesian Theory[M].John Wiley & Sons,1994.
  • 4Johnson J,Tsioutsiouliklis K,Giles L.Evolving Strategies for Focused Web Crawling[C] ∥Proc of Int'l Conf on Machine Learning,2003:298-305.
  • 5Zhao Xu,Jiang Zongli.An Indexing Matrix Based Retrieval Model[M] ∥Lecture Notes in Computer Science.Berlin/Heidelberg:Springer,2008:1001-1008.
  • 6Chakrabarti S,Joshi M,Tawde V.Enhanced Topic Distillation Using Text Markup Tags,and HyPerlinks[C] ∥Proc of SIGIR'01,2001:208-216.
  • 7Cheeseman P,Stutz J.Bayesian Classi_cation (AutoClass):Theory and Results[M].Fayyad U,Piatetsky-Shapiro G,Smyth P,eds.AAI/MIT Press,1996.

二级参考文献8

  • 1McCallum A, Nigam K, Rennie J, et al. Building domain-specific search engine with machine learning techniques [A]. AAAI Spring Symposium on Intelligent Agents in Cyberspace, Stanford University,USA,1999.
  • 2Chakrabarti S M, van den Berg H, Dom B. Focused crawling: a new approach to topic-specific Web resource discovery [J]. Computer Networks,1999,31(11-16):1 623-1 640.
  • 3Diligenti M, Coetzee F M, Lawrence S, et al. Focused crawling using context graphs [A]. 26th International Conference on Very Large Database, Cairo,Egypt, 2000.
  • 4Chakrabarti S, Kunal P, Mellela S. Accelerated focused crawling through online relevance feedback [A]. The Eleventh International Conference on World Wide Web, Hawaii,USA,2002.
  • 5Nigam K. Using unlabeled data to improve text classification [D]. Pittsburgh, USA: School of Computer Science, Carnegie Mellon University, 2001.
  • 6Jing Peng, Williams R. Incremental multi-step Q-learning [J]. Machine Learning,1996,22(1-3):283-290.
  • 7Wiering M, Schmidhuber J. Fast online Q(λ)[J]. Machine Learning,1998,33(1):105-115.
  • 8宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘[J].软件学报,2002,13(8):1508-1514. 被引量:28

共引文献17

同被引文献74

引证文献7

二级引证文献68

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部