期刊文献+

基于Simhash的海量相似文档快速搜索优化方法 被引量:7

Simhash-Based Optimization Method for Fast Massive Similar Document Search
下载PDF
导出
摘要 相似文档搜索指检索与给定查询文档相似的文档,在大数据处理中具有广泛的应用,如近似网页检测、新闻报道聚合以及抄袭检测等。为实现海量相似文档的快速搜索,可采用Simhash指纹方法将文档映射成二进制指纹,以海明距离表达文档相似度,并通过指纹分段建立索引提高计算效率。针对传统方法在指纹分段过程中大量冗余计算影响到计算效率的问题,提出了基于顺序匹配的候选集筛选方法,以减少指纹相似性计算量和网络带宽消耗,实现快速搜索。试验表明,该方法具有较好的性能和可扩展性。 A similar document search is to find similar documents for a query document. It is widely used in the big data processing, such as near-duplicate webpage detection, related news aggregation and plagiarism detection. To search massive similar document efficiently, the Sire- hash fingerprint method is applied for projecting the document to the compact binary code. The Hamming distance represents the document similarity. The fingerprint is partitioned into sub- codes as an index to accelerate computing performance. The candidate set filtering method based on the sequence matching is used to reduce the capacity of fingerprint similarity computation and the network bandwidth usage for fast search. Experimental results show that the method achieves high performance and good flexibility.
出处 《指挥信息系统与技术》 2015年第2期61-65,共5页 Command Information System and Technology
基金 软件新技术与产业化协同创新中心部分资助项目
关键词 Simhash方法 相似文档搜索 顺序匹配 Simhash method similar document search sequence match
  • 相关文献

参考文献11

  • 1Govindaraju V, Ramanathan K. Similar document search and recommendation[J]. Journal of Emerging Technologies in Web Intelligence, 2012,4 ( 1 ) : 84-93.
  • 2Dasdan A,D'Alberto P, Kolay S, et al. Automatic re- trieval of similar content using search engine query in- terfaee[C]//Proeeedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong : ACM, 2009 : 701-710.
  • 3Pereira A, Ziviani N. Retrieving similar documents from the Web[J]. Journal of Web Engineering,2004,2 (4) :247-261.
  • 4Charikar M. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the 34th An- nual ACM Symposium on Theory of Computing.Montreal : ACM, 2002 : 380-388.
  • 5Manku G,Jain A, Sarma A D. Detecting near-dupli- cates for Web crawling[C]//Proceedings of the 16th International Conference on World Wide Web. Banff: ACM, 2007: 141-149.
  • 6Papadimitriou P, Garcia-Molina H, Dasdan A. Web graph similarity for anomaly detection[J]. Journal of Internet Services and Applications, 2010,1 (1) : 19-30.
  • 7徐济惠.基于Simhash算法的海量文档反作弊技术研究[J].计算机技术与发展,2014,24(9):103-107. 被引量:7
  • 8Uddin M S,Roy C K,Schneider K A,et al. On the ef- fectiveness of simhash for detecting near-miss clones in larger scale software systems[C]//Proceedings of the 18th Working Conference on Reverse Engineering (WCRE). Lero : IEEE, 2011 : 13-22.
  • 9Williams K,Wu J, Giles C L. SimSeerX: a similar document search engine[C]//Proceedings of the 2014 ACM Symposium on Document Engineering. Fort Collins : ACM, 2014 : 143-146.
  • 10宋金玉,陈爽,郭大鹏,王内蒙.数据质量及数据清洗方法[J].指挥信息系统与技术,2013,4(5):63-70. 被引量:31

二级参考文献50

  • 1陈伟,陈耿,朱文明,王昊.基于业务规则的错误数据清理方法[J].计算机工程与应用,2005,41(14):172-174. 被引量:10
  • 2高凯,王永成,肖君.网页去重策略[J].上海交通大学学报,2006,40(5):775-777. 被引量:13
  • 3郭双宙,梁金兰.构件库用户反馈子系统的客观反馈的设计[J].计算机技术与发展,2007,17(5):129-132. 被引量:2
  • 4Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
  • 5Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
  • 6Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
  • 7Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
  • 8Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.
  • 9liang Qi-xia, Sun Mao-song. Semi-supervised SimHash for effi- cient document similarity search[C]. In: Proceedings of the 49th Annual Meeting of the Association for Computa~onal Linguistics, 2011 : 93-101.
  • 10Panagiotis Papadimitriou, Ali Dasdan, Hector Garcia-Molina. Web graph similarity for anomaly detection[ J]. Journal of Internet Serv- ices and Applications,2010, 1 ( 1 ) : 19-30.

共引文献54

同被引文献37

引证文献7

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部