基于动态隧道算法的网络爬行器设计与实现

Web Crawler's Design and Implementation Based on Dynamic Tunneling

下载PDF

导出

摘要在分析传统网络爬行器爬行算法的基础上,通过将隧道算法和网页页面分块技术相结合,指导专题爬行器进行爬行。通过4所高校门户网站的教育资源搜索实验表明,新的算法可以有效提高搜索效率。 Based on analysis of the traditional Web Crawlers＇ searching mechanics, this paper combines the tunneling and Web page division with Web Crawler＇ s searching strategy. Then a dynamic tunneling Web Crawler＇ s searching algorithm is proposed. Experiments in four university Websites are carried out in allusion to ＂education resources＂, and resuits show that the new algorithm outperforms two standard crawlers for focused crawling.

作者任小燕康小军张红卫

机构地区三峡大学电气信息学院三峡大学信息中心

出处《现代图书情报技术》 CSSCI 北大核心 2008年第6期83-87,共5页 New Technology of Library and Information Service

基金湖北省教育厅教学研究项目"多层次计算机网络实验教学改革与实践"(项目编号:20070229)的研究成果之一

关键词爬行器隧道穿越网页分块 Web crawlers Tunneling Web page division

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献9

1Bermark D , Lagoze C, Sbiltyakov A. Focused Crawls, Tunneling,and Digital Libraries[ C ]. In: Proceedings of the 6th European Conferrence on Research Advanced Technology for Digital Libraries, Lecture Notes In Computer Science,2002,2458:91 - 106.
2Luo N,Zuo W L,Yuan F Y, Gray Tunneling Based on Block Relevance for Focused Crawling[ EB/OL]. [ 2007 - 12 - 30 ]. http :// www. atlantis - press. com/php/download_paper? id = 1288.
3封化民,刘飚,刘艳敏,方勇,宋国森.含有位置坐标树的Web页面分析和内容提取框架[J].清华大学学报（自然科学版）,2005,45(S1):1767-1771. 被引量：8
4Lin S H, Ho J M. Discovering Informative Content Blocks from Web Documents [ C ]. In : Proceedings of the ACM SIGKDD Int. 2002. New York : ACM Press, 2002:588 - 593.
5Kovacevic M, Diligenti M, Gori M, et al. Recognition of Common Area in a Web Page Using Visual Information: A Possible Application in a Page Classification[ C ]. In: Proceeding of the lOth international Conference on Artifical Intelligence : Methodology, Systems, Application. Varna:Springer,2002:203-212.
6荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报（自然科学版）,2004,32(z1):84-87. 被引量：21
7王知津,贾福新,郑红军.现代信息检索[M].北京:机械工业出版社,2006.
8Srinivasan P, Menczer F, Pant G. A General Evaluation Framework for Topical Crawlers [ J ]. Information Retrieval, 2005,8 ( 3 ) :417 - 447.
9教育信息化技术标准委员会.CELTS-31:教育资源建设技术规范[EB/OL].[2005-12-20].http://www.edu.cn/html/key-anfz/doc20020210/13.doc.

二级参考文献11

1[1]Lin Shian-hua, Ho Jan-ming. Discovering informative content blocks from Web documents [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Edmonton :ACM Press,2002.588 - 593.
2[2]Yi Lan,Liu Bing, Li Xiao-li. Eliminating noisy information in Web pages for data mining [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Washington, DC: ACM Press ,2003. 296 - 305.
3[3]Kovacevic Milos, Dilligenti Michelangelo, Gori Marco,et al. Recognition of common areas in a Web page using a visualization approach [A]. Proceeding of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications [C]. Varna: Springer,2002.203 - 212.
4[4]Gupta Suhit, Kaiser Gail E, Neistadt David. et al. DOMbased content extraction of HTML documents [A].Proce-eding of the 12th International World Wide Web Conference [C]. Budapest: ACM Press ,2003. 207 - 214.
5[5]Cai Deng, Yu Shi-peng, Wen Ji-rong, et al. Extracting content structure for Web pages Based on visual representation [A]. Proceeding of the 6th Asia Pacific Web Conference [C]. Xian: Springer,2003. 406 - 417.
6Finn A,Kushmerick N,Smyth B.Fact or fiction: Content classification for digital librarie[].Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries.2001
7Kovacevic M.Recognition of common areas in web page using visual information: A possible application in a page classification[].Proceedings of ICDM.2002
8Gupta S,Kaiser G,Neistadt D,et al.DOM based content extraction of HTML documents[].Proc of the th World Wide Web Conference (WWW ).2003
9YI Lan,LIU Bing.Web page cleaning for web mining through feature weighting[].Proceedings of Eighteenth International Joint Conference on Artificial Intelligence(IJCAI - ).2003
10Lin S-H,Ho J-M.Discovering informative content blocks from web documents[].Proceedings of the ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining(SIGKDD’ ).2002

共引文献24

1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
2陈雪,徐慧,沈家峻.基于网页结构的网页去噪算法设计[J].软件,2013,34(8):95-97. 被引量：1
3钟佳,王文涛.基于分块的超链引导的主题搜索改进算法[J].中南民族大学学报（自然科学版）,2006,25(2):84-87.
4蒲强,李鑫,刘启和,杨国纬.一种Web主题文本通用提取方法[J].计算机应用,2007,27(6):1394-1396. 被引量：5
5刘晨曦,吴扬扬.一种基于块分析的网页去噪音方法[J].广西师范大学学报（自然科学版）,2007,25(2):149-152. 被引量：8
6徐薇.Web信息采集中页面分块技术的研究[J].武汉科技学院学报,2007,20(5):43-45. 被引量：2
7张恒,屈景辉,张亮.网页文本信息提取及结果评价[J].微计算机应用,2007,28(9):921-924. 被引量：10
8时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,33(19):276-278. 被引量：17
9李军杰,刘克胜,赵有才.基于改进kNN算法的网页分类系统设计[J].网络安全技术与应用,2007(11):51-52.
10王建冬,王继民,田飞佳.一种基于内容规则的网页去噪算法[J].现代图书情报技术,2008(3):51-54. 被引量：4

1李园伟.面向高校主题搜索引擎的的爬行器设计[J].电脑知识与技术,2011,7(6):3866-3868.
2宋立廷,卢燕飞.企业VOIP应用的安全管理策略[J].中国科技博览,2009(18):74-74.
3周赟山.语音数据报NAT/FW隧道穿越的实例研究[J].数字技术与应用,2011,29(5):176-178.
4李卫疆,赵铁军,朴星海.网络爬行器的分布式设计[J].计算机工程,2009,35(4):105-107.
5黄莉,王成良,杨铮.面向主题网络爬行的智能隧道穿越算法研究[J].计算机应用研究,2009,26(8):2931-2933. 被引量：6
6李卫疆,赵铁军.面向Blog的爬行算法[J].计算机工程与应用,2008,44(31):1-3.
7李卫疆,赵铁军,朴星海.一种新的面向主题的爬行算法[J].计算机应用研究,2009,26(5):1663-1666. 被引量：5
8张文龙,刘一伟,孙杰.基于Nutch的垂直搜索引擎的研究[J].南开大学学报（自然科学版）,2012,45(2):37-44. 被引量：5
9褚丽莉.基于Java的搜索引擎技术在Web信息挖掘中的应用[J].辽宁工程技术大学学报（自然科学版）,2010,29(5):1006-1008. 被引量：6
10孙亮.基于支持向量机的网络舆情危机预警探究[J].自动化与仪器仪表,2016(11):138-139. 被引量：3

现代图书情报技术

2008年第6期

浏览历史

内容加载中请稍等...

基于动态隧道算法的网络爬行器设计与实现

参考文献9

二级参考文献11

共引文献24

相关作者

相关机构

相关主题

浏览历史