摘要
互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。
Many web pages are replicated in the internet. Finding the near- replicas of web pages has become the key to improve the efficiency of the information retrieval and web pages collection. This paper first presents existing near- replicas detection algorithms, including algorithms based on"fingerprints"or feature code. Then we propose a near- replicas detection algorithm based on Levensh- tein Distance,that is we obtain the amount of similarity between two web pages by computing Levenshtein Distance of two web page fin- gerprint sequences. This algorithm overcomes the shortcoming that algorithms based on"fingerprints"or feature code didn't take ac- count of the text structure of web pages,compares both the text content and structure of web pages and makes the near- replicas detec- tion of web pages more accurate. This algorithm has been proved to be effective by experiment,and both the precision and recall rate are high.
出处
《网络新媒体技术》
2013年第6期1-7,共7页
Network New Media Technology
基金
国家高技术研究发展计划(863计划)"十二五"计划项目课题(2012AA011102)
国家语委"十二五"科研项目(YB125-53)
中国科学院学部咨询项目(Y129091211)
关键词
互联网
网页去重
指纹
编辑距离
Internet
Near-replicas Detection
Fingerprint
Levenshtein Distance