期刊文献+

基于特征串的大规模中文网页快速去重算法研究 被引量:41

The Study on Large Scale Duplicated Web Pages of ChineseFast Deletion Algorithm Based on String of Feature Code
下载PDF
导出
摘要 网页检索结果中 ,用户经常会得到内容相同的冗余页面 ,其中大量是由于网站之间的转载造成。它们不但浪费了存储资源 ,并给用户的检索带来诸多不便。本文依据冗余网页的特点引入模糊匹配的思想 ,利用网页文本的内容、结构信息 ,提出了基于特征串的中文网页的快速去重算法 ,同时对算法进行了优化处理。实验结果表明该算法是有效的 ,大规模开放测试的重复网页召回率达 97 3% ,去重正确率达 99 5 %。 Reprinting of information between websites produces a great deal redundant web pages that not only waste storage resource but also bring many burdens to user in retrieval and reading.In this paper string of feature code based algorithm is developed to remove the duplicated web page after analyzing the feature of the redundant web page.The idea of fuzzy matching and information of content and structure of the text of web page are introduced into the algorithm,and the efficiency of the algorithm is optimized.The experiment results show that the algorithm is effective.The recall rate of duplicated web pages reaches 97.3%,and the precision rate of the duplication removal reaches 99.5% in large scale testing.
出处 《中文信息学报》 CSCD 北大核心 2003年第2期28-35,共8页 Journal of Chinese Information Processing
关键词 计算机应用 中文信息处理 特征串 模糊匹配 去重算法 冗余网页 computer application Chinese information processing String of Feature Code Fuzzy Matching Duplicate Removal Algorithm Redundant Web Pages
  • 相关文献

参考文献5

  • 1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

同被引文献289

引证文献41

二级引证文献125

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部