期刊文献+

基于网页分块技术主题爬行器的实现 被引量:4

Realization of Focused Crawler Based on Page Segmentation
下载PDF
导出
摘要 针对目前通用搜索引擎搜索到的结果过多、与主题相关性不强的现状,提出一种基于网页分块技术的主题爬行器实现方法,并实现了一个原型系统Crawler1.实验结果表明,本系统性能较好,所爬网页的相关度在55%以上. In the light of result returned currently by general-purpose search engines being excessive, and having no strong similarity with the topic, this paper covers a technique of dividing the web page to chunks to implement a focused crawler. With this method, Crawlerl, a prototype of a focused crawler has been realized. Experimental results indicate that Crawlerl has better performance. The number of topic web pages crawled by Crawlerl attains more than 55%.
出处 《吉林大学学报(理学版)》 CAS CSCD 北大核心 2007年第6期959-965,共7页 Journal of Jilin University:Science Edition
基金 国家自然科学基金(批准号:60373099)
关键词 主题搜索 主题爬行 相关度分析 网页分块 topic-specific search focused crawling relevance analysis page segmentation
  • 相关文献

参考文献11

  • 1Fetterly D, Manasse M, Najork M, et al. A Large-scale Study of the Evolution of Web Pages [ C ]//Proceedings of the 12th International World Wide Web Conference. Budapest, Hungary: ACM Press, 2003: 669-678.
  • 2赫枫龄,左万利.利用超链接信息改进网页爬行器的搜索策略[J].吉林大学学报(信息科学版),2005,23(1):59-63. 被引量:8
  • 3SuGuiyang LiJianhua MaYinghua LiShenghong SongJuping.New focused crawling algorithm[J].Journal of Systems Engineering and Electronics,2005,16(1):199-203. 被引量:1
  • 4Cho J, Garcia-Molina H, Page L. Efficient Crawling through URL Ordering [ J ]. Computer Networks, 1998, 30 (1/7) : 161-172.
  • 5Menczer F. Complementing Search Engines with Online Web Mining Agents [ J ]. Decision Support Systems, 2003, 35(2) : 195-212.
  • 6SONG Rui-hua, LIU Hai-feng, WEN Ji-rong, etal. Learning Block Importance Models for Web Pages [ C ]//The Thirteenth World Wide Web Conference (WWW 2004). New York: ACM Press, 2004: 203-211.
  • 7Diligenti M, Coetzee F, Lawrence S, et al. Focused Crawling Using Context Graphs [ C ]//26th International Conference on Very Large Databases, VLDB 2000. Cairo, Egypt : [ s. n. ], 2000 : 527-534.
  • 8LIN Shian-hua, HO Jan-ming. Discovering Informative Content Blocks from Web Documents [ C ]//Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2002: 588-593.
  • 9Clark A, Cyber Neko. HTML Parser [ EB/OL]. 2005-06-18. http://people. apache. org/- andyc/neko/doc/html.
  • 10鲁松,李晓黎,白硕,王实.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-13. 被引量:120

二级参考文献21

  • 1AGGARWAL C, AL- GARAWI F, YU P. Intelligent Crawling on the World Wide Web with Arbitrary Predicates [A]. In Proceedings of the 10th International World Wide Web Conference ( 2001 ) [ C]. Hong Kong: ACM Press, 2001 : 96-105.
  • 2JENNY EDWARDS, KEVIN MCCURLEY, JOHN TOMLIN. An Adaptive Model for Optimizing Performance of an Incremental Web Crawler [A]. In Proceedings of the 10th International World Wide Web Conference (2001) [C]. Hong Kong: ACM Press, 2001 : 106-113.
  • 3MUKHERJEA S. WTMS: A System for Collecting and Analyzing Topic-Specific Web Information [ A]. In Proceedings of the 9th International World Wide Web Conference [ C]. Amsterdam: Netherlands ACM Press, 2000: 15-19.
  • 4JUNGHO0 CHO, HECTOR GARCIA MOLINA. Parallel Crawlers [ A]. In Proc. of the 11 th International World-Wide Web Conference (2002) [C]. Honolulu, Hawaii: ACM Press, 2002: 124-135.
  • 5DENNIS FETTERLY, MARK MANASSE, MARC NAJORK, JANET WIENER. A Large-Scale Study of the Evolution of Web Pages [ A]. In Proceedings of the 12th International World Wide Web Conference (May 2003) [ C]. Budapest, Hungary:ACM Press, 2003: 669-678.
  • 6HEYDON A, NAJORK M. Mercator: A Scalable, Extensible Web Crawler [J]. World Wide Web, 1999, 2 (4) : 219-229.
  • 7CHO J, GARCIA-MOLINA H, PAGE L. Etficient Crawling Through URL Ordering [ A ]. In Proceedings of the 7th International WWW Conference (1998) [C]. Brisbane, Australia: ACM Press, 1998: 213-225.
  • 8ANDREI Z. BRODER, MARC NAJORK, JANET L WIENER. EFficient URL Caching for World Wide Web Crawling [ A].In Proceedings of the 12th International World Wide Web Conference (May 2003 ) [ C ]. Budapest, Hungary: ACM Press,2003 : 679-689.
  • 9VLADISLAV SHKAPENYUK, TORSTEN SUEL. Design and Implementation of a High-performance Distributed Web Crawler[ A]. In Proceedings of the 18th International Conference on Data Engineering (ICDE02, February, 2002) [ C]. San Jose,California: IEEE, 2002 : 357-369.
  • 10NAJORK M, WIENER J. Breadth-first Search Crawling Yields High-quality Pages [A]. In Proceedings of the 10th International World Wide Web Conference (2001) [C]. Hong Kong: ACM Press, 2001: 114-118.

共引文献126

同被引文献25

  • 1杨频,李涛,赵奎.一种网络舆情的定量分析方法[J].计算机应用研究,2009,26(3):1066-1068. 被引量:19
  • 2钟敏娟,林亚平,陈治平.基于超链接和标记文本的信息检索算法[J].小型微型计算机系统,2004,25(7):1344-1347. 被引量:7
  • 3车东.在应用中加入全文检索功能-基于Java的全文索引引擎Lucene简介[EB/OL].Http:www.chedong.com/tech/lucene.html,2002.
  • 4CHO J, GARCIA-MOLINA H, PAGE L. Efficient Crawling Through URL Ordering[J]. Computer Networks,1998, 30(1-7) : 161-172.
  • 5Diligenti M, Gori M, Maggini M. Web Page Scoring Systems for Horizontal and Vertical Search[C]//Proc of the Eleventh Int'l World Wide Web Conf, 2002 : 508-516.
  • 6Chris Ding, He Xiaofeng, Zha Hongyuan, et al. PageRank, HITS and a Unified Framework for Link Analysis[C] // Proc of the 25th ACM SIGIR Conf,2002: 353-354.
  • 7Clark A, Neko C. HTML Parser[EB/OL]. [2005-06-18]. http://people. apache. org/-andyc/neko/doc/html.
  • 8Arasu A, Cho J H, Molina H G, et al. Searching the Web[J].ACM Transactions on Internet Technology,2001,8(1): 2- 43.
  • 9刘亚妹 黄岳.基于Lucene的MYSearch全文搜索引擎.电子技术应用,2010,(18):89-91.
  • 10刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量:132

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部