专题型网页搜集系统的设计与实现

Research and Implementation of Intelligent Focused Crawler

下载PDF

导出

摘要近年来人们提出了很多新的搜集思想,他们都使用了一个共同的技术———集中式搜集。集中式搜集通过分析搜索的区域,来发现与主题最相关的链接,防止访问网上不相关的区域,这可以大量地节省硬件和网络资源,使网页得到尽快的更新。为了达到这个搜索目标,本文提出了两个算法:一个是基于多层分类的网页过滤算法,试验结果表明,这种算法有较高的准确率,而且分类速度明显高于一般的分类算法;另一个是基于Web结构的URL排序算法,这个算法充分地利用了Web的结构特征和网页的分布特征。 Several new crawling ideas have been proposed in recent years;among them a common technique is focused crawling.A focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.This leads to significant savings in hardware and network resources,and helps keep the crawl more up-to-date.To achieve such goal-directed crawling,this paper puts forward two algorithms:a Web page filtering based on multilayer classifier,the experimental result shows the algorithm has superior veracity and it is more quick than other classifiers;the other algorithm is a URL ordering algorithm based on Web structure which makes the best use of the characters of Web structure and the characters of Web pages distributing.

作者胡卓颖徐可万中英陆玉昌丁树良

机构地区清华大学计算机科学与技术系江西师范大学计算机科学与技术学院

出处《计算机与现代化》 2004年第10期1-5,14,共6页 Computer and Modernization

基金国家自然科学基金资助项目(79990580) 国家973资助项目(G1998030414)

关键词 URL排序集中式搜集器多层分类主题过滤 URL ordering focused crawler multi-layer classification topic distillation

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献10

1S Chakrabarti,K Punera,M Subramanyam.Accelerated focused crawling through online relevance feedback[A].Proceedings of the 11th World Wide Web Conference (WWW)[C].2002.
2C C Aggarwal,F Al-Garawi,P S Yu.Intelligent crawling on the World Wide Web with arbitrary predicates[A].Proc.10th International World Wide Web Conference[C].2001.96-105.
3S Chakrabarti,M van den Berg,B Dom.Focussed crawling[A].A New Approach to Topic Specific Resource Discovery[C].WWW Conference,1999.
4S Chakrabarti,M van den Berg,B Dom.Distributed hypertext resource discovery through examples[A].VLDB Conference[C].1999.
5B D Davison.Predicting Web actions from HTML content[A].Proceedings of the Thirteenth ACM Conference on Hypertext and Hypermedia(HT′02)[C].College Park,MD,June 2002.159-168.
6J Cho,H Garcia-Molina,L Page.Efficient crawling through URL ordering[A].WWW Conference[C].1998.
7M Diligenti et al.Focused crawling using context graphs[A].VLDB Conference[C].2000.
8刘少辉,董明楷,张海俊,李蓉,史忠植.一种基于向量空间模型的多层次文本分类方法[J].中文信息学报,2002,16(3):8-14. 被引量：75
9宋聚平,王永成,滕伟,许欢庆.搜索引擎中Robot搜索算法的优化[J].情报学报,2002,21(2):130-133. 被引量：21
10鲁松,李晓黎,白硕,王实.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-13. 被引量：120

二级参考文献22

1黄萱菁.大规模中文文本的检索、分类与摘要研究.复旦大学博士学位论文[M].,1998..
2[1]Mark A.C.Overmeer.My personal search engine.Computer Networks,1999,31:2271～2279
3[2]S.Lawrence,C.Lee Giles.Accessibility of information on the Web.Nature,1999,400
4[3]M.Koster.Robots in the web:threat or treat.Conne Xions,1995,9(4) http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html
5[4]Krishan Bharat,Andrei Broder,Monika Henzinger,etc..The connectivity derver:fast access to linkage information on the web.Proc.7th International World Wide Web Conference,1998
6[5]Soumen Chakrabarti.Mining the Web's link structure.Computer,IEEE,1999,August:60～67
7[6]Altigran S.Da Silva,Eveline A.Veloso,Paulo B.Golgher,etc..CoBWeb--A crawler for the Brazilian Web.String Processing and Information Retrieval Symposium,1999:184～191
8[7]C.M.Bowman,P.B.Danzig,D.R.Hardy,U.Manber,and M.F.Schwartz.Harvest:a scalable,customizable discovery and access system.Technical Report CU-CS-732-94,1994
9[8]H.Yamana,K.Tamur,H.Kawano,S.Kamei,M.Harada,etc.Experiments of collecting www information using distributed www robots.In Proceedings of the 21st International ACM SIGIR Conference,Australian,1998
10[9]Y.S.Maarek,et al.WebCutter:a system for dynamic and tailorable site mapping.Proc.of 6th WWW Conference,Santa Clara,USA,April,1997

共引文献210

1吴楠.Robot算法分析[J].舰船电子工程,2008,28(1):107-108.
2江禅志,王才元.Robot算法分析[J].舰船电子工程,2008,28(6):160-161.
3高磊,徐东平.启发式算法在搜索引擎的应用[J].电脑知识与技术（过刊）,2007(2):426-427.
4郭一平,王亮.资源整合系统中搜索引擎的研究[J].高等工程教育研究,2006,54(S1):108-110. 被引量：4
5周延泉,张传福,张瑞华,李蕾,何华灿.移动个性化信息服务中的用户兴趣模型[J].北京邮电大学学报,2006,29(z2):144-147. 被引量：1
6高伟锋,刘连芳.基于分词和基于N-Gram的网页分类系统比较研究[J].广西科学院学报,2005,21(S1):58-60. 被引量：1
7吴光远,何丕廉,曹桂宏,聂颂.基于向量空间模型的词共现研究及其在文本分类中的应用[J].计算机应用,2003,23(z1):138-140. 被引量：23
8蒯晓童,王银娣.搜索引擎Robot技术的优化算法研究[J].地理空间信息,2004,2(4):32-34.
9许增福,梁静国,田晓宇.基于FVSM和自组织映射网络的Web文本自动分类方法[J].哈尔滨工业大学学报,2004,36(9):1168-1172. 被引量：2
10葛蓉.利用网络日志分析提高搜索引擎的检准率[J].情报科学,2004,22(10):1250-1253. 被引量：5

1朱红斌,蔡郁.基于支持向量机的多层分类入侵检测系统研究[J].丽水学院学报,2008,30(2):54-57.
2王福海.基于PageRank的主题过滤算法改进[J].科技信息,2011(15). 被引量：3
3孙钢,张彬,郭军.一种多层分类的入侵检测系统[J].计算机工程与应用,2003,39(26):24-27. 被引量：1
4徐川龙,顾勤龙,姚明海.一种基于三维加速度传感器的人体行为识别方法[J].计算机系统应用,2013,22(6):132-135. 被引量：32
5徐炳雪,史建华,钱俊臣,赵玉铎,倪健.基于加速度传感器的人体行为识别系统的设计与实现[J].电脑开发与应用,2014,27(12):55-57. 被引量：1
6白玉昭,梁久祯.基于概率模型的主题爬虫的研究和实现[J].计算机工程与科学,2013,35(1):160-165. 被引量：7
7叶琳莉,林嵩凯.基于Web结构挖掘算法的网站构建[J].电脑知识与技术,2008,3(12):1619-1620. 被引量：1
8李辉,张华熊.一种基于云环境的PageRank改进算法[J].浙江理工大学学报（自然科学版）,2012,29(3):404-407.
9曹红梅.一种专题型空间信息系统数据模型和数据结构设计方法[J].北京测绘,1999,13(1):24-25.
10闵钰麟,黄永峰.用户定制主题聚焦爬虫的设计与实现[J].计算机工程与设计,2015,36(1):17-21. 被引量：8

计算机与现代化

2004年第10期

浏览历史

内容加载中请稍等...

专题型网页搜集系统的设计与实现

参考文献10

二级参考文献22

共引文献210

相关作者

相关机构

相关主题

浏览历史