基于编辑距离的网页去重策略

Near-replicas of Web Pages Detection Based on Levenshtein Distance

下载PDF

导出

摘要互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。 Many web pages are replicated in the internet. Finding the near- replicas of web pages has become the key to improve the efficiency of the information retrieval and web pages collection. This paper first presents existing near- replicas detection algorithms, including algorithms based on＂fingerprints＂or feature code. Then we propose a near- replicas detection algorithm based on Levensh- tein Distance,that is we obtain the amount of similarity between two web pages by computing Levenshtein Distance of two web page fin- gerprint sequences. This algorithm overcomes the shortcoming that algorithms based on＂fingerprints＂or feature code didn＇t take ac- count of the text structure of web pages,compares both the text content and structure of web pages and makes the near- replicas detec- tion of web pages more accurate. This algorithm has been proved to be effective by experiment,and both the precision and recall rate are high.

作者丁泽亚张全

机构地区中国科学院声学研究所中国科学院大学

出处《网络新媒体技术》 2013年第6期1-7,共7页 Network New Media Technology

基金国家高技术研究发展计划(863计划)"十二五"计划项目课题(2012AA011102) 国家语委"十二五"科研项目(YB125-53) 中国科学院学部咨询项目(Y129091211)

关键词互联网网页去重指纹编辑距离 Internet Near-replicas Detection Fingerprint Levenshtein Distance

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量：21
2MANBER U. Finding similar files in a large file system[A].San Fransisco,USA,1994.1-10.
3BRIN S,DAVIS J,GARCIA-MOLINA H. Copy detection mechanisms for digital documents[J].ACM SIGMOD Record,1995,(02):409-430.
4HEINTZE N. Scalable Document Fingerprinting (Extended Abstract)[A].Stanford,California:Stanford University,1996.191-200.
5BRODER A,GLASSMAN S,MANASSE M. Syntactic clustering of the web[J].Computer Networks and ISDN Systems,1997,(8-13):1157-1166.
6SHIVAKUMAR N,GARCIA-MOLINA H. Finding near-replicas of documents on the web[J].The World Wide Web and Databases,1999.204-212.
7SHIVAKUMAR N,GARCIA-MOLINA H. Building a scalable and accurate copy detection mechanism[A].New York,USA,1996.160-168.
8SI A,LEONG H,LAU R. Check:a document plagiarism detection system[A].San Francisco,CA,USA,1997.70-77.
9黄仁,冯胜,杨吉云,刘宇,敖民.基于正文结构和长句提取的网页去重算法[J].计算机应用研究,2010,27(7):2489-2491. 被引量：13
10李林,刘桂峰,赵朋朋,崔志明.结构化信息的去重方法[J].计算机工程,2009,35(3):23-25. 被引量：3

二级参考文献39

1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量：21
2王萌,何婷婷,张伟.基于概念向量空间模型的中文自动文摘系统[J].计算机工程与应用,2005,41(1):107-110. 被引量：5
3索红光,曹淑英.基于组块的中文自动文摘系统研究[J].计算机系统应用,2007,16(3):97-100. 被引量：2
4Nam G W, Park J H , Kim T Y. Dynamic Management of URL Based on Object Oriented Paradigm[C]//Proceedings of the International Conference on Parallel and Distributed Systems. Taiwan, China: IEEE Computer Society Press, 1998: 226-230.
5Shivakumar N, Garcia Molilna H. Finding Near Replicas of Documents on the Web[C]//Proceedings of Workshop on Web Databases. [S.l.]: Springer Press, 1998: 204-212.
6Cho J H, Shivakumar N, Garcia Molina H. Finding Replicated Web Collections[C]//Proceedings of the ACM International Conference on Management of the Data. [S. l.]: ACM Press, 2000.
7Bharat K, Broder A Z. Mirror, Mirror, on the Web: A Study of Host Pairs with Replicated Content[J]. Computer Networks, 1999, 31 (11-16): 1579-1590.
8Elmagarmid A K, Member S. Duplicate Record Detection: A Survey[C]. IEEE Transactions on Knowledge and Data Engneering, 2007, 19(1): 1-16.
9Shivakumar N.Finding near-replicas of documents on the web[C]// International Workshop on the Web and Databases, Valencia, Spain, Web DB, 1998: 204-212.http ://dbpu bs.stanford.edu/pu b/1998-31.
10Cho J,Shivakumar N,Garcia-Molina H.Finding replicated Web collections[C]//Proceedings of 2000 ACM International Conference on Management of Data(SIGMOD), May 2000.

共引文献84

1曹传东,郭理.一种基于文本抽取的网页正文去重算法[J].科技信息,2009(1):102-103. 被引量：1
2张智江,王志军,张尼.一种可应用于大流量环境的双层散列算法研究[J].电信科学,2011,27(S1):280-284.
3梁正友,张林才.基于Rabin指纹方法的URL去重算法[J].计算机应用,2008,28(S2):185-186. 被引量：8
4张明辉,王成耀,宋威.一种基于段落的分段签名近似镜像新算法[J].情报杂志,2005,24(1):21-23. 被引量：2
5燕彩蓉,彭勤科,沈钧毅,武红江.基于两阶段散列的Web集群服务器内容分配研究[J].西安交通大学学报,2005,39(8):812-815. 被引量：5
6李玉玲.厦门高校学生对本地就业意愿的调查[J].中国大学生就业,2005(16):37-38. 被引量：1
7詹川,卢显良,侯孟书,邢茜.一种快速的基于URL的垃圾邮件过滤系统[J].计算机科学,2005,32(8):55-56. 被引量：3
8徐凤刚,许俊奎,潘清.可扩展Hash方法的一种改进算法[J].计算机工程与应用,2006,42(4):95-97. 被引量：3
9肖明忠,闵博楠,王佳聪,代亚非.一个实用的针对URL的哈希函数[J].小型微型计算机系统,2006,27(3):538-541. 被引量：3
10吴丽辉,白硕,张刚,张凯.Web信息采集中的哈希函数比较[J].小型微型计算机系统,2006,27(4):673-676. 被引量：8

1杨祥清.存储系统数据去重策略研究[J].信息通信,2014,27(8):132-132. 被引量：5
2张艳.基于专业搜索引擎的网页去重技术研究[J].软件导刊,2012,11(4):138-141.
3熊忠阳,牙漫,张玉芳.基于网页正文结构和特征串的相似网页去重算法[J].计算机应用,2013,33(2):554-557. 被引量：11
4洪华剑,叶东毅.含规则数优化的多目标属性约简进化算法[J].小型微型计算机系统,2016,37(8):1707-1711. 被引量：1
5高凯,王永成,肖君.网页去重策略[J].上海交通大学学报,2006,40(5):775-777. 被引量：13
6徐娜,刘四维,汪翔,倪卫明.基于Bloom Filter的网页去重算法[J].微型电脑应用,2011(3):48-51. 被引量：6
7周小平,黄家裕,刘连芳,梁一平,申文明.基于网页正文主题和摘要的网页去重算法[J].广西科学院学报,2009,25(4):251-253. 被引量：5
8吴小惠.分布式网络爬虫URL去重策略的改进[J].平顶山学院学报,2009,24(5):116-119. 被引量：3
9张玉连,王莎莎,宋桂江.基于元搜索的网页去重算法[J].燕山大学学报,2011,35(2):121-123. 被引量：2
10苏国荣,杨岳湘,邓劲生.一种去除重复URL的算法[J].广西师范大学学报（自然科学版）,2010,28(1):122-126. 被引量：5

网络新媒体技术

2013年第6期

浏览历史

内容加载中请稍等...

基于编辑距离的网页去重策略

参考文献14

二级参考文献39

共引文献84

相关作者

相关机构

相关主题

浏览历史