期刊文献+

基于URL类型和网页链接变化的信息采集更新算法 被引量:1

The Crawling Refreshment Algorithm Based on URL Type and Outlink Change
下载PDF
导出
摘要 通过观察网站呈现网页的规律及网页本身的结构特点,提出基于URL类型及网页链接变化规律的入口页面识别算法,优先抓取入口页面.在实际应用中,取得了较好的更新效果. The refreshment algorithm based on URL type and outlink change is proposed by observing the page orderliness of Web sites and the structural characteristics of the page. This algorithm is used for fetching the entry pages,and a perfect effect in real application is obtained.
出处 《郑州大学学报(理学版)》 CAS 2007年第2期60-64,共5页 Journal of Zhengzhou University:Natural Science Edition
基金 国家自然科学基金资助项目 编号90412015
关键词 入口页面 网页更新 增量采集 entry page page refreshment incremental crawler
  • 相关文献

参考文献7

  • 1文坤梅,卢正鼎.搜索引擎中基于分类的网页更新方法研究[J].计算机科学,2004,31(B09):1-2. 被引量:1
  • 2Edwards J,McCurley K,Tomlin J.An adaptive model for optimizing performance of an incremental Web crawler[C]∥Proceedings of the 10th Int'l Conference on World Wide Web.New York:ACM Press,2001:106-113.
  • 3Castillo C,Baeza-Yates R.A new model for Web crawling[C]∥Proceedings of the 11th World Wide Web Conference.New York:ACM Press,2002:1-4.
  • 4Yan H F,Wang J Y,Li X M,et al.Architectural design and evaluation of an efficient Web-crawling system[J].Journal of Systems and Software,2002,60(3):185-193.
  • 5孟涛,王继民,闫宏飞.网页变化与增量搜集技术[J].软件学报,2006,17(5):1051-1067. 被引量:22
  • 6Kraaij W,Westerveld T,Hiemstra D.The importance of prior probabilities for entry page search[C]∥Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,NY,USA:ACM Press,2002:1-2.
  • 7胡俊刚,董守斌,陈晓志,张元丰.基于URL类型优先级的入口页面查询算法[J].山东大学学报(理学版),2006,41(3):63-67. 被引量:1

二级参考文献12

  • 1孟涛,闫宏飞,王继民.Web网页信息变化的时间局部性规律及其验证[J].情报学报,2005,24(4):398-406. 被引量:8
  • 2北京大学网络实验室.中文Web信息检索评测[Z].北京:北京大学网络实验室,2006.
  • 3Kraaij W, Westerveld T, Hiemstra D. The importance of prior probabilities for entrypage search[A]. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New York, USA:ACM Press, 2002.1 - 2.
  • 4北京大学网络实验室.SEWM 22004中文Web检索测试指南[Z].北京:北京大学网络实验室,2004.
  • 5Ricardo Baeza-Yate, Berthier Ribeiro-Neto. Modeminformation retrieval[M].北京:机械工业出版社,2005.
  • 6D Hiemstra. Using language models for information retrieval.PhD thesis [M]. University of Twente, The Netherlands: Centre for Telematics and Information Technology, 2001.
  • 7丁国栋.统计语言建模中的平滑技术[EB/OL].http://159.226.40.18/reports/smoothing% 20for% 20slm. ppt, 2004-04/2006-03.
  • 8Hodgson J. Do HTML tags flag semantic content? [J]. IEEE Internet Computing, 2001, 5(1):20-25.
  • 9T Upstill, N Craswell, D Hawking. Query-independent evidence in home page finding[J]. ACM Transactions on Information Systems, 2003, 21(3) :3 - 5.
  • 10E M Voorhees, D K Harman. The tenth text retrieval conference (TREC-2001)[J]. National Institute of Standards and Technology, NIST, 2002, 10(2) : 1 - 2.

共引文献21

同被引文献3

  • 1Kim Yeonjung.Web Information Extraction by HTML Tree Edit Distance Matching[C]//Proc.of the 2007 International Conference on Convergence Information Technology.Gyeongju,Korea:[s.n.],2007:2455-2460.
  • 2Selkow S M.The Tree-to-tree Editing Problem[J].Information Processing Letters,1977,6(6):184-186.
  • 3de Castro R D.Automatic Web News Extraction Using Tree Edit Distance[C]//Proceedings of the 13th International Conference on World Wide Web.New York,USA:[s.n.],2004:502-511.

引证文献1

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部