期刊文献+

支持向量机在化学主题爬虫中的应用 被引量:8

Research on chemistry focused crawler with support vector machine classifier
原文传递
导出
摘要 爬虫是搜索引擎的重要组成部分,它沿着网页中的超链接自动爬行,搜集各种资源。为了提高对特定主题资源的采集效率,文本分类技术被用来指导爬虫的爬行。本文把基于支持向量机的文本自动分类技术应用到化学主题爬虫中,通过SVM 分类器对爬行的网页进行打分,用于指导它爬行化学相关网页。通过与基于广度优先算法的非主题爬虫和基于关键词匹配算法的主题爬虫的比较,表明基于SVM分类器的主题爬虫能有效地提高针对化学Web资源的采集效率。 Crawler is an important component of search engine, which collects Web pages through hyperlink between the pages. In order to enhance the performance of topic-specific search engines, text categorization techniques can be used to direct the crawling of focused crawlers. Based on Support Vector Machine, a new chemistry focused crawler is proposed in this paper. It can guide the focused crawler to collect the chemistry Web pages, and ignore the irrelevant information. The experiment results show that the focused crawler with SVM classifier is more effective to collect chemistry relevant pages, compared to the crawlers based on breadth first and keyword matching.
出处 《计算机与应用化学》 CAS CSCD 北大核心 2006年第4期329-332,共4页 Computers and Applied Chemistry
基金 国家自然科学基金资助项目(20273076)
关键词 支持向量机(SVM) 化学主题爬虫 文本分类 搜索引擎 support vector machine, chemistry focused crawler, text categorization, search engine
  • 相关文献

参考文献15

  • 1杜阿宁,方滨兴,胡铭曾.一个基于决策树的中文Web文本挖掘系统.搜索引擎与Web挖掘进展.北京:高等教育出版社,2003,3.
  • 2李晓霞,杨章远,许志宏.Internet化学资源的发展状况与展望[J].计算机与应用化学,1999,16(5):325-326. 被引量:22
  • 3Heydon A and Mercator NM. A scalable, extensible web crawler.World Wide Web, 1999, 2(4) :219 -229.
  • 4王继成,萧嵘,孙正兴,张福炎.Web信息检索研究进展[J].计算机研究与发展,2001,38(2):187-193. 被引量:118
  • 5Vapnik V. The Nature of Statistical Learning Theory. New York:Springer, 1995.
  • 6Burges CJC. A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining, 1998.
  • 7中国科学院过程工程研究所.化学信息门户.http://www.chinweb.com,2005-05
  • 8Salton G and McGill MJ. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
  • 9Yang Y and Pedersen J. A comparative study on feature selection in text categorization. International Conference on Machine Learning(ICML), 1997.
  • 10Joachims T. Text categorization with support vector machines: Learning with Many Relevant Feature. Proceedings of European Conference on Machine Learning(ECML), 1998.

二级参考文献35

  • 1王继成 邹涛 等.网络信息搜集与出版系统WinGPS.南京大学计算机科学与技术系,科技报告[M].,1999..
  • 2[8]Cho,Molina. Synchronizing a database to improve freshness. In:Junghoo Cho, Hector Garcia-Molina, eds. Proc. of 2000 ACM Intl. Conf. on Management of Data(SIGMOD),May 2000
  • 3[9]Cho, Molina, Page. Efficient Crawling Through URL Ordering.In: Junghoo Cho,Hector Garcia-Molina and Lawrence Page, eds.Proc. of the Seventh Intl. World Wide Web Conf. Toronto,Canada,May 1999
  • 4[10]Edwards,et al. An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In: J. Edwards, K. McCurley, J.Tomlin,eds. Proc. of the 10th Intl. World Wide Web Conf. Hong Kong ,May 2001
  • 5[11]Heydon ,Najork .Mercator:A Scalable,Extensible Web Crawler.A. Heydon and M. Najork. In World Wide Web Journal, Dec.1999. 219~229
  • 6[12]Kamba T,Bharat K,Albers M. The Krakatoa Chronicle - An Interactive, Personalized, Newspaper on the Web. In: Proc. of WWW 4,Boston, USA,Dec. 1995
  • 7[13]Kahle B. Preserving the Internet,Scientific American,March 1997
  • 8[14]Koster M. The Web Robots Pages. 1999
  • 9[15]Lawrence S,Giles C L. Accessibility of information on the Web.Nature, 1999,400(6740) :107~109
  • 10[16]Letizia. An Agent That Assists Web Browsing. In:H. Lieberman,ed. Proc. of the Intl. Joint Conf. on AI,Montreal ,Canada,Aug.1995

共引文献162

同被引文献149

引证文献8

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部