期刊文献+

基于Simhash算法的重复数据删除技术的研究与改进 被引量:15

Research and improvement of data de-duplication based on simhash algorithm
下载PDF
导出
摘要 为了在大规模文档去重中提高相似数据检测的精度,对基于Simhash算法的大规模文档去重技术进行深入研究。在原有算法的基础之上对Simhash签名值的计算过程作出改进,引入ICTCLAS分词技术,将TF-IDF技术作为计算权重的主要方法,同时将特征值的词性与词长两大影响因素考虑其中。然后对产生的签名值进行汉明距离的比较,从而精确地判定出待比较者是否为相似数据。实验结果表明:改进的算法性能得到提高,并且总体优于Shingle算法和原Simhash算法。通过提高签名值的精度能够实现大规模文档中相似技术的精确检测,达到理想的去重效果。 To improve the detecting accuracy of approximately duplicated records in extensive data de-du- plication, an extensive data de-duplication technology based on Simhash algorithm is studied. Based on the existing algorithms, Simhash algorithm has made an improvement in calculation process to introduce ICTCLAS word segmentation technology and gain weight value, it sets the TF-IDF technology as the main method for calculating weight value. Furthermore, the part-of-speech and the word length are introduced as a considered weighting factor, then comparing the hamming distances between signatures are compared to accurately identify whether they are alike. The simulation results show that the modified algorithm has high accuracy .and recall rate, and the detection performance of is superior to the Shingle algorithm and the prime algorithm. By improving the accuracy of the signature value, it can realize the accurate detec- tion of extensive data de-duplication, thus achieving the perfect results.
出处 《南京邮电大学学报(自然科学版)》 北大核心 2016年第3期85-91,共7页 Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金 国家自然科学基金(11501302)资助项目
关键词 相似检测 Simhash算法 TF-IDF技术 指纹计算 汉明距离 similarity detection Simhash algorithm TF-IDF technology fingerprint calculation ham-ming distance
  • 相关文献

参考文献5

二级参考文献55

  • 1中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/in-dex/OE/00/11/index.htm,2005,07-01
  • 2Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL]. http://gatekeeper. research.compaq.com/pub/DEC/SRC/technical--notes/SRC--1997--015 html
  • 3吴军,数学之美系列十三信息指纹及其应用[DB/OL].http://www.googlechinablog.com/2006/08/blog-post.html
  • 4Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near--Duplicates for Web Crawlng[C]. In ternational World Wide Web Conference, Banff, Alberta, Canada, New York, USA: ACM, 2007: 141-- 150
  • 5Moses S. Charikar, Similarity Estimation Tech niques from Rounding Algorithms[C]. Annual ACM Sym posium on Theory of Computing, Montreal, Quebec, Cana da, New York, USA:ACM, 2002 : 380-388
  • 6Gantz J, Chute C, Manfrediz A, et al. The diverse and exploding digital universe: An updated forecast of worldwide information growth through 2011 [EB/OL]. [2008 03 05]. http://www, ifap. ru/library/book268, pdf.
  • 7McKnight J, Asaro T, Babineau B. Digital archiving: end user survey and market forecast 2006-2010 [EB/OL]. [2006-03-18]. httpz//WWW, enterprisestrategygroup, com/ESGPublications/ ReportDetail. asp?ReportID= 591.
  • 8Clements A, Ahmad I, Vilayannur M, et al. Decentralized deduplication in SAN cluster file systems [C] //Proc of the USENIX ATC'09. Berkeley, CA: USENIX, 2009:101-114.
  • 9Zhu B, Li Kai, Patterson H. Avoiding the disk bottleneck in the Data Domain deduplieation file system [C] //Proc of the USENIX FAST'08. Berkeley, CA: USENIX, 2008: 269- 282.
  • 10Yang Tianming, Jiang Hong, Feng Dan, et al. DEBAR: A scalable high-performance de-duplication storage system for backup and arehiving [C] //Proc of the IEEE IPDPS'10. Piscataway, NJ: IEEE, 2010:1-12.

共引文献102

同被引文献123

引证文献15

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部