期刊文献+

基于目录树的网络科技资源采集算法 被引量:3

Crawler Algorithm Based on Directory Tree in Network Science and Technology Resource
下载PDF
导出
摘要 针对网络科技领域资源分类方式多样化、数据量大等特点,提出一种基于目录树的采集算法,以领域本体知识库提供的本体知识作为评价依据进行有效目录链接的提取和识别,通过一种改进的链接分析策略获取有效的节点链接并进行采集操作。该算法研究采集体系结构,注重对最新资源获取速度的优化。实验结果证明,该算法可有效提高资源采集速率。 Aimming at full consideration of the characteristics of the network technology in a various methods of classification of resources and a large quantity, this paper proposes a kind of crawler algorithm based on directory tree. The algorithm extracts and recognizes the directory links based on domain ontology knowledge as effective evaluation, and links the nodes effectively through a modified strategy of link analysis, eventually carry through collecting operation. The algorithm not only studies in-depth on the crawler architecture, but also pays attention to the speed of access to the latest resources optimization. Experimental results show that the algorithm can effectively achieve the established objectives both in speed and efficiency.
出处 《计算机工程》 CAS CSCD 北大核心 2009年第1期277-279,282,共4页 Computer Engineering
基金 国家科技基础条件平台建设基金资助项目(2005DKA63904)
关键词 科技资源 信息采集 目录树 本体 science and technology resource information crawling directory tree ontology
  • 相关文献

参考文献7

  • 1Li Jun, Furuse K, Yamaguchi K. Focused Crawling by Exploiting Anchor Text Using Decision Tree[C]//Proc. of the 14th International World Wide Web Conference. Chiba, Japan: [s. n.], 2005: 1190-1191.
  • 2Cheng Jing, Li Qing, Wang Liping, et al. Automatically Generating An E-textbook on the Web[M]. Berlin, Germany: Springer-Verlag Heidelberg, 2004: 35-42.
  • 3Chcn Xucqi. Query Rewriting for Extracting Data behind Html Forms[D]. Provo, Utah, USA: Brigham Young University, 2004.
  • 4Sun Maosong, Chen Qunxiu. Language Computing and Text Processing Based on Contents[M]. Beijing, China: Tsinghua University Press, 2003: 488-494.
  • 5李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82. 被引量:11
  • 6曾义聪,杨贯中.基于概念树的主题搜索机器人系统研究[J].科学技术与工程,2006,6(16):2458-2463. 被引量:3
  • 7Troy W. Automating the Extraction of Domain-specific Information from the Web-A Case Study for the Genealogical Domain[D]. Provo, Utah, USA: Brigham Young University, 2004.

二级参考文献10

  • 1曾义聪,杨贯中,刘柯.基于概念树的主题爬取技术研究[J].科学技术与工程,2005,5(12):785-790. 被引量:3
  • 2[1]Chakrabarti S,van den Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discovery.Computer Networks,1999; 31 (11-16):1623-1640
  • 3[2]Ganesh S,Jayaraj M,SrinivasaMurthy V K,et al.Ontology-based Web crawler.Proceedings of Information Technology:Coding and Computing (ITCC'04).Washington,DC:IEEE Computer Society,2004:337-341
  • 4[4]Cheng Jing,Li Qing,Wang Liping,et al.Automatically generating an e-textbook on the Web.In:Lecture Notes in Computer Science 3143.Berlin:Springer-Verlag Heidelberg,2004:35-42
  • 5[5]Open Directory Project.http://dmoz.org,2004-12-13
  • 6Cho J,Garcia-Molina H,Page L.Efficient Crawling Through URL Ordering[C]//Proceedings of the 7^th International World Wide Web Conference.1998:161-172.
  • 7Najork M,Wiener J L.Breadth-first Crawling Yields High-quality Pages[C]//Proceedings of the 10^th International World Wide Web Conference.2001:114-118.
  • 8Li Jun,Furuse K,Yamaguchi K.Focused Crawl -ing by Exploiting Anchor Text Using DecisionTree[C]//Proceedings of the 14^th International World Wide Web Conference.2005:1190-1191.
  • 9Castillo C.Effective Web Crawling[D].University of Chile,2004.
  • 10Brin S,Page L.The Anatomy of a Large-scale Hypertextual Web Search Engine[J].Computer Networks and ISDN Systems,1998,30(1-7):107-117.

共引文献12

同被引文献65

引证文献3

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部