

Near-replicas of Web Pages Detection Based on Levenshtein Distance
摘要 互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。 Many web pages are replicated in the internet. Finding the near- replicas of web pages has become the key to improve the efficiency of the information retrieval and web pages collection. This paper first presents existing near- replicas detection algorithms, including algorithms based on"fingerprints"or feature code. Then we propose a near- replicas detection algorithm based on Levensh- tein Distance,that is we obtain the amount of similarity between two web pages by computing Levenshtein Distance of two web page fin- gerprint sequences. This algorithm overcomes the shortcoming that algorithms based on"fingerprints"or feature code didn't take ac- count of the text structure of web pages,compares both the text content and structure of web pages and makes the near- replicas detec- tion of web pages more accurate. This algorithm has been proved to be effective by experiment,and both the precision and recall rate are high.
作者 丁泽亚 张全
出处 《网络新媒体技术》 2013年第6期1-7,共7页 Network New Media Technology
基金 国家高技术研究发展计划(863计划)"十二五"计划项目课题(2012AA011102) 国家语委"十二五"科研项目(YB125-53) 中国科学院学部咨询项目(Y129091211)
关键词 互联网 网页去重 指纹 编辑距离 Internet Near-replicas Detection Fingerprint Levenshtein Distance
  • 相关文献


  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2MANBER U. Finding similar files in a large file system[A].San Fransisco,USA,1994.1-10.
  • 3BRIN S,DAVIS J,GARCIA-MOLINA H. Copy detection mechanisms for digital documents[J].ACM SIGMOD Record,1995,(02):409-430.
  • 4HEINTZE N. Scalable Document Fingerprinting (Extended Abstract)[A].Stanford,California:Stanford University,1996.191-200.
  • 5BRODER A,GLASSMAN S,MANASSE M. Syntactic clustering of the web[J].Computer Networks and ISDN Systems,1997,(8-13):1157-1166.
  • 6SHIVAKUMAR N,GARCIA-MOLINA H. Finding near-replicas of documents on the web[J].The World Wide Web and Databases,1999.204-212.
  • 7SHIVAKUMAR N,GARCIA-MOLINA H. Building a scalable and accurate copy detection mechanism[A].New York,USA,1996.160-168.
  • 8SI A,LEONG H,LAU R. Check:a document plagiarism detection system[A].San Francisco,CA,USA,1997.70-77.
  • 9黄仁,冯胜,杨吉云,刘宇,敖民.基于正文结构和长句提取的网页去重算法[J].计算机应用研究,2010,27(7):2489-2491. 被引量:13
  • 10李林,刘桂峰,赵朋朋,崔志明.结构化信息的去重方法[J].计算机工程,2009,35(3):23-25. 被引量:3


  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2王萌,何婷婷,张伟.基于概念向量空间模型的中文自动文摘系统[J].计算机工程与应用,2005,41(1):107-110. 被引量:5
  • 3索红光,曹淑英.基于组块的中文自动文摘系统研究[J].计算机系统应用,2007,16(3):97-100. 被引量:2
  • 4Nam G W, Park J H , Kim T Y. Dynamic Management of URL Based on Object Oriented Paradigm[C]//Proceedings of the International Conference on Parallel and Distributed Systems. Taiwan, China: IEEE Computer Society Press, 1998: 226-230.
  • 5Shivakumar N, Garcia Molilna H. Finding Near Replicas of Documents on the Web[C]//Proceedings of Workshop on Web Databases. [S.l.]: Springer Press, 1998: 204-212.
  • 6Cho J H, Shivakumar N, Garcia Molina H. Finding Replicated Web Collections[C]//Proceedings of the ACM International Conference on Management of the Data. [S. l.]: ACM Press, 2000.
  • 7Bharat K, Broder A Z. Mirror, Mirror, on the Web: A Study of Host Pairs with Replicated Content[J]. Computer Networks, 1999, 31 (11-16): 1579-1590.
  • 8Elmagarmid A K, Member S. Duplicate Record Detection: A Survey[C]. IEEE Transactions on Knowledge and Data Engneering, 2007, 19(1): 1-16.
  • 9Shivakumar N.Finding near-replicas of documents on the web[C]// International Workshop on the Web and Databases, Valencia, Spain, Web DB, 1998: 204-212.http ://dbpu bs.stanford.edu/pu b/1998-31.
  • 10Cho J,Shivakumar N,Garcia-Molina H.Finding replicated Web collections[C]//Proceedings of 2000 ACM International Conference on Management of Data(SIGMOD), May 2000.









使用帮助 返回顶部