期刊文献+

专题型网页搜集系统的设计与实现

Research and Implementation of Intelligent Focused Crawler
下载PDF
导出
摘要 近年来人们提出了很多新的搜集思想,他们都使用了一个共同的技术———集中式搜集。集中式搜集通过分析搜索的区域,来发现与主题最相关的链接,防止访问网上不相关的区域,这可以大量地节省硬件和网络资源,使网页得到尽快的更新。为了达到这个搜索目标,本文提出了两个算法:一个是基于多层分类的网页过滤算法,试验结果表明,这种算法有较高的准确率,而且分类速度明显高于一般的分类算法;另一个是基于Web结构的URL排序算法,这个算法充分地利用了Web的结构特征和网页的分布特征。 Several new crawling ideas have been proposed in recent years;among them a common technique is focused crawling.A focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.This leads to significant savings in hardware and network resources,and helps keep the crawl more up-to-date.To achieve such goal-directed crawling,this paper puts forward two algorithms:a Web page filtering based on multilayer classifier,the experimental result shows the algorithm has superior veracity and it is more quick than other classifiers;the other algorithm is a URL ordering algorithm based on Web structure which makes the best use of the characters of Web structure and the characters of Web pages distributing.
出处 《计算机与现代化》 2004年第10期1-5,14,共6页 Computer and Modernization
基金 国家自然科学基金资助项目(79990580) 国家973资助项目(G1998030414)
关键词 URL排序 集中式搜集器 多层分类 主题过滤 URL ordering focused crawler multi-layer classification topic distillation
  • 相关文献

参考文献10

  • 1S Chakrabarti,K Punera,M Subramanyam.Accelerated focused crawling through online relevance feedback[A].Proceedings of the 11th World Wide Web Conference (WWW)[C].2002.
  • 2C C Aggarwal,F Al-Garawi,P S Yu.Intelligent crawling on the World Wide Web with arbitrary predicates[A].Proc.10th International World Wide Web Conference[C].2001.96-105.
  • 3S Chakrabarti,M van den Berg,B Dom.Focussed crawling[A].A New Approach to Topic Specific Resource Discovery[C].WWW Conference,1999.
  • 4S Chakrabarti,M van den Berg,B Dom.Distributed hypertext resource discovery through examples[A].VLDB Conference[C].1999.
  • 5B D Davison.Predicting Web actions from HTML content[A].Proceedings of the Thirteenth ACM Conference on Hypertext and Hypermedia(HT′02)[C].College Park,MD,June 2002.159-168.
  • 6J Cho,H Garcia-Molina,L Page.Efficient crawling through URL ordering[A].WWW Conference[C].1998.
  • 7M Diligenti et al.Focused crawling using context graphs[A].VLDB Conference[C].2000.
  • 8刘少辉,董明楷,张海俊,李蓉,史忠植.一种基于向量空间模型的多层次文本分类方法[J].中文信息学报,2002,16(3):8-14. 被引量:75
  • 9宋聚平,王永成,滕伟,许欢庆.搜索引擎中Robot搜索算法的优化[J].情报学报,2002,21(2):130-133. 被引量:21
  • 10鲁松,李晓黎,白硕,王实.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-13. 被引量:120

二级参考文献22

  • 1黄萱菁.大规模中文文本的检索、分类与摘要研究.复旦大学博士学位论文[M].,1998..
  • 2[1]Mark A.C.Overmeer.My personal search engine.Computer Networks,1999,31:2271~2279
  • 3[2]S.Lawrence,C.Lee Giles.Accessibility of information on the Web.Nature,1999,400
  • 4[3]M.Koster.Robots in the web:threat or treat.Conne Xions,1995,9(4) http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html
  • 5[4]Krishan Bharat,Andrei Broder,Monika Henzinger,etc..The connectivity derver:fast access to linkage information on the web.Proc.7th International World Wide Web Conference,1998
  • 6[5]Soumen Chakrabarti.Mining the Web's link structure.Computer,IEEE,1999,August:60~67
  • 7[6]Altigran S.Da Silva,Eveline A.Veloso,Paulo B.Golgher,etc..CoBWeb--A crawler for the Brazilian Web.String Processing and Information Retrieval Symposium,1999:184~191
  • 8[7]C.M.Bowman,P.B.Danzig,D.R.Hardy,U.Manber,and M.F.Schwartz.Harvest:a scalable,customizable discovery and access system.Technical Report CU-CS-732-94,1994
  • 9[8]H.Yamana,K.Tamur,H.Kawano,S.Kamei,M.Harada,etc.Experiments of collecting www information using distributed www robots.In Proceedings of the 21st International ACM SIGIR Conference,Australian,1998
  • 10[9]Y.S.Maarek,et al.WebCutter:a system for dynamic and tailorable site mapping.Proc.of 6th WWW Conference,Santa Clara,USA,April,1997

共引文献210

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部