期刊文献+

基于主题相似度指导网络蜘蛛穿越隧道的爬行算法 被引量:5

A Crawling Algorithm Based on Topical Similarity for Guiding the Web Crawler Though Tunnels
下载PDF
导出
摘要 隧道穿越一直是主题网络蜘蛛爬行研究的难点,本文在分析了网页主题特征和普通隧道技术爬行算法缺点的基础上,提出了使用主题相似度指导网络蜘蛛穿越隧道的爬行算法,并用朴素贝叶斯分类器方法提高主题相似度计算精度。实验表明,本文提出的隧道穿越技术在查准率和查全率方面都比普通隧道技术有很大提高。 Tunneling is always the difficulty of topical web crawling. On the basis of analysing the Web topical features and the shortcomings of the general tunneling technology, this paper raises the algorithm using topical similarity to guide the web crawler though tunnels, and improves the accuracy of topical similarity using the Naive Bayesian classifier. The experimental results show that this algorithm is better than the general tunneling technology in precision and recall rate.
作者 陈小海 周娅
出处 《计算机工程与科学》 CSCD 北大核心 2009年第10期126-128,共3页 Computer Engineering & Science
基金 广西自然科学基金资助项目(桂科青0832101)
关键词 主题网络蜘蛛 隧道穿越 主题相似度 topical web crawler tunneling topical similarity
  • 相关文献

参考文献10

  • 1Kleinberg J. Authoritative Source in a Hyperlinked Enviroment[C]//Proc of the 9th ACM-SIAM Symp Algorithms, 1999:604-632.
  • 2Cho J,Garcia- Molina H,Page L. Efficient Crawling Through URL Ordering[J]. Computers Networks and ISDN Systems, 1998,30(1) :161-172.
  • 3Bergmark D, Lagoze C, Sbityakov A. Focused Crawl Tunneling and Digital Libraries[C]//Proc of the 6th European Conf on Digital Libraries,Rome,2002:91-106.
  • 4Diligenti M, Coetzee F M, Lawrence S, et al. Focused Crawling Using Context Graphs[C]//Proc of the 26th VLDB Conf, 2000: 527-534.
  • 5Rennie J, MeCallum A. Using Reinforcement Learning to Spider the Web Efficiently[C]//Proc of the16th Int'l Conf on Machine Learn, 1999 : 335-343.
  • 6Salton G, MeGill M J. Introduction to Modern Information Retrieval[J]. Journal of the American Society for Information Science, 1983,41 : 288-297.
  • 7Aas K, Eikvil L. Text Categorisation: A Survey[R]. Technical Report # 941, Norwegian Computing Center, 1999.
  • 8曹存根,丰强泽,高颖,顾芳,司晋新,眭跃飞,田雯,王海涛,王丽丽,曾庆田,张春霞,郑宇飞,周肖彬.Progress in the Development of National Knowledge Infrastructure[J].Journal of Computer Science & Technology,2002,17(5):523-534. 被引量:15
  • 9Chakrabarti S, Dom B, lndyk P. Enhanced Hypertext Categorization Using Hyperlinks[C]///Proc of the ACM SIGMOD Int'l Conf on Management of Data, 1998: 307-318.
  • 10Pant G, Srinivasan P. Link Contexts in Classifier-Guided Topical Crawlers[J]. IEEE Trans on Knowledge and Data Engineering, 2006,18( 1 ) : 107-122.

二级参考文献33

  • 1Guha R V, Lenat D B. Cyc: A midterm report. AI Magazine, 1990, 11(3): 32-59.
  • 2Guha R V. Contexts: A formalization and some applications. Tech. ACT-CYC-423-91, MCC, Austin, Texas, 1991.
  • 3Lenat D B. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 1995, 38(11):33-38.
  • 4Lenat D B, Guha R V. Building Large Knowledge-Based Systems. Addison-Wesley, MA, 1990.
  • 5Lenat D B,Miller G A, Yokoi T.-CYC, WordNet and EDR - critiques and responses - discussion. Communications of the ACM, 1995, 38(11): 45-48.
  • 6Clark P, Porter B. Building domain representations from components. AI96-241, University of Texas at Austin,1996.
  • 7Clark P, Porter B. Building concept representation from reusable components. In Proceedigs of 1997 AAAI, AAAI Press, 1997, pp.369-376.
  • 8Richardson S. Determining similarity and inferring relations in a lexical knowledge base [Dissertation]. City University of New York, 1997.
  • 9Richardson S, Dolan W B, Vanderwende L. MindNet: Acquiring and structuring semantic information from text.In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, ACL, 1998, CONF 17, Vol.2, pp.1098-1102.
  • 10Chaudhri V K, Farquhar A et al. The generic frame protocol 2.0. SRI International Technical Report,1997.

共引文献14

同被引文献24

引证文献5

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部