期刊文献+

一种图像主题网络爬虫的实现方法研究 被引量:2

Design and Implementation of a Web Crawler for Images
下载PDF
导出
摘要 针对一种图像主题爬虫进行了设计研究,采用了基于文字内容的启发式方法,实现了借助图像文件的锚文本及其上下文进行主题相关性判定,能更准确的抓取相关图像资源.还对网页实现了主题相关性判定,以便更有效地引导爬虫的爬行路经.经实验证明,本系统可起到一定的优化效果,为实现定向主题的图像信息采集奠定了良好的基础. An approach of a web crawler for images is designed and implemented in this paper. An elicitation method based on text content is adopted, and the determination of topic correlation is realized with the help of the anchor text of image files and their contexts, to snatch at resources of relevant images more accurately. The paper also carries out the determination of topic correlation of images so as to pilot more effectively the crawling path of the crawlers. Experiments prove that the system has a certain effect of optimization, and lays a good foundation of realizing the collection of image information of directional topics.
出处 《南京师范大学学报(工程技术版)》 CAS 2008年第4期115-117,166,共4页 Journal of Nanjing Normal University(Engineering and Technology Edition)
关键词 链接锚文本链接上下文 网络爬虫 JXTA 主题爬虫 anchor text, link-content Web crawler, JXTA, topical crawler
  • 相关文献

参考文献4

  • 1[1]De Bra P,Houben G,Kornatzky Y,et al.Information retrieval in distributed hypertexts[C]//Proc of the 4th RIAO Conference.New York,1994:481-491.
  • 2[3]Chakrabarti S,Punera K,Subramanyam M.Accelerated focused crawling through online relevance feedback[C].Proc of the 11 th International World Wide Web Conference.Hawaii:[s.n.],2002.
  • 3[5]Brin S,Page L.The anatomy of a large-scale hypertextual Web search Engine[C].Proc the 7th World Wide Web Conference,[s.n.],1998:146-164.
  • 4[6]Lucene[EB/OL].http://lucene.apache.org/,2008.7.21.

同被引文献17

  • 1周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:156
  • 2顾金睿,王芳.关于本体论的研究综述[J].情报科学,2007,25(6):949-956. 被引量:19
  • 3Takahashi T,Soonsang H,Taura K,et all.World Wide Web Crawler[OL].[2009-05-09].http://www.2002.org/ CDROM/poster/182/.
  • 4Shkapenyuk V,Suel T.Design and implementation of a high-performance distributed Web crawler[C] ∥Proceedings of the 18th International Conference on Data Engineering,April,2002:357-368.
  • 5Sing L.JXTA 2:A High-Performance,Massively Scalable P2P Network[OL].[2009-05-09].http:// www.ibm.com/developerworks/java/library/j-jxta2/.
  • 6De Bra P,Houben G,Kornatzky Y,et al.Information retrieval in distributed hypertexts[C] ∥Proceedings of the 4th RIAO Conference.New York,1994:481-493.
  • 7Chakrabarti S,Punera K,Subramanyam M.Accelerated focused crawling through online relevance feed-back[C ] ∥Proceedings of the 11th International World Wide Web Conference.Hawaii,2002:148-159.
  • 8Jakarta Common HttpClient[OL].[2008-03-01].http://hc.apache.org/httpclient-3.x/.
  • 9Najork M,Heydon A.High-Performance Web Crawling,COMPAQ System Research Center(SRC),Research Report[R].Kluwer Academic Publishers,September,2001.
  • 10Dnsjava[OL].[2008-03-24].http://www.dnsjava.org/.

引证文献2

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部