期刊文献+

基于HTML标记和长句提取的网页去重算法 被引量:2

Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence
下载PDF
导出
摘要 提出了一种高效的算法来去除互联网上的重复网页。该算法利用HTML标记过滤网页中的干扰信息,然后提取出能表征一张网页的长句作为网页的特征。通过分析两张网页所共享长句的数量,来判断两张网页是否重复。该算法还利用红黑树对网页的长句进行索引,从而把网页去重过程转换为一个搜索长句的过程,减小了算法的时间复杂度。实验结果表明该算法能够高效,准确地去除重复的网页。 We have developed an efficient algorithm to eliminate the duplicate web pages. This algorithm takes advantage of HTML tags to filter the noise of a page, and extracts those long sentences that can represent a page, as the features of the page. And we use the number of long sentences that shared by two pages, as the metric of duplication. This algorithm uses a red-black tree to index those long sentences, and changes the elimination process into a search process. So that it can reduce the running time. The result of our experiments shows that this algorithm can efficiently and correctly eliminate duplicate web pages.
出处 《微型电脑应用》 2009年第8期30-32,5,共3页 Microcomputer Applications
关键词 网页去重 页面去杂 长句 红黑树 Duplicate web page elimination Page cleanup Long sentence Red-black tree
  • 相关文献

参考文献5

  • 1Thomas H. Cormen et al. Introduction to Algorithms[M]. 北京:高等教育出版社,2002.273-293.
  • 2Broder A. Syntactic Clustering of the Web [C] // 6th International World Wide Web Conference Apr. 1997: 393-404.
  • 3Fetterly D. On the Evolution of Clusters of Near- Duplicate Web Pages [C] // 1st Latin American Web Congress. Nov.2003:37-45.
  • 4Rabm M.Fingerprinting by random polynomials.Report TR- 15- 81 [ R ]. Center for Research m Computing Technology, Harvard University,1981.
  • 5Salton G, McGill M.,Introduction to Modem Information Retrieval[M],New York:McGraw-Hill, 1983.

同被引文献11

  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2王小华,卢小康.基于N-Gram的文本去重方法研究[J].杭州电子科技大学学报(自然科学版),2010,30(2):61-64. 被引量:5
  • 3LI Wei,LIU Jian-yi,WANG Cong.Web document duplicate removal algorithm based on keyword sequences[C] //Proc of Natural Language Processing and Knowledge Engineering.Valencia:IEEE Press,2005:511-516.
  • 4HEINTZE N.Scalable document fingerprinting[C] //Proc of the 2nd USENIX Workshop on Electronic Commerce.Oakland,CA:Citeseer,1996:191-200.
  • 5BRODER A Z,GLASSMAN S C,MANASSE M S.Syntactic clustering of the Web[C] //Proc of the 6th International Web Conference.Amsterdam:Elsevier Science Publisher B.V,1997:1157-1166.
  • 6CORMEN T H,LEISERSON C E,RIVEST R L,et al.Introduction to algorithms[M].Massachusetts:MIT Press,2002:273-293.
  • 7A. Broder et al.Syntactic Clustering of the Web. 6th International World Wide Web Conference . 1997
  • 8Andrei Broder,Michael Mitzenmacher.Network Applications of Bloom Filters:A Survey. Internet Mathematics . 2004
  • 9Bloom,BH.Space/time trade-offs in hash coding with allowable errors. Communications of the ACM . 1970
  • 10魏丽霞,郑家恒.基于网页文本结构的网页去重[J].计算机应用,2007,27(11):2854-2856. 被引量:13

引证文献2

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部