期刊文献+

基于多SimHash指纹的近似文本检测 被引量:21

Efficient Near-duplicate Detection Based on Multiple SimHash Fingerprints
下载PDF
导出
摘要 近似文本检测已成为当前研究热点.基于SimHash指纹的近似文本检测是主流的检测方法之一.但使用SimHash进行近似文本检测存在如下问题:指纹位数单一,丢失了一定量的信息.针对该问题,为使SimHash指纹尽可能多地代表文档的内容或特征,通过对术语集的统计特征分析,提出基于多SimHash指纹和k维超曲面的近似文本检测算法.实验表明基于多Sim-Hash指纹的近似文本检测算法提高了检测的准确率,而且所增加的时间代价很小. Near-duplicate detection has attracted significant attention over the past years. The near-duplicate detection based on Sim-Hash is one of the state-of-the-art algorithms. However, there exists a problem for this method: SimHash maps high-dimensional vectors to small-sized and well formatted (fixed length) fingerprints, which lost a certain amount of information. To solve the problem, this paper firstly introduces the analyses of statistical characteristics of term sets. Then a novel and efficient near-duplicate detection scheme based on multiple SimHash fingerprints and k-dimensional hypersurfaees is presented. Experimental results prove that the scheme can significantly improve the precision and F1, and execution times are almost remained unchanged.
出处 《小型微型计算机系统》 CSCD 北大核心 2011年第11期2152-2157,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60825202 60803079 60921003 61070072)资助 国家科技支撑计划项目(2009BAH51B02)资助 "核高基"国家科技重大专项(2010ZX01045-001-005)资助 长江学者奖励计划项目资助 新世纪优秀人才支持计划项目(NECT-08-0433)资助 IBM Research China University Relation Program资助
关键词 近似文本检测 SimHash 多SimHash指纹 术语统计 near-duplicate detection SimHash multiple SimHash fingerprints term statistics
  • 相关文献

参考文献9

  • 1Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
  • 2Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
  • 3Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
  • 4liang Qi-xia, Sun Mao-song. Semi-supervised SimHash for effi- cient document similarity search[C]. In: Proceedings of the 49th Annual Meeting of the Association for Computa~onal Linguistics, 2011 : 93-101.
  • 5ScoR Huffrnan, April Lehman, Alexei Stolboushkin, et al. Multi- pie-signal duplicate detection for march evaluation[ C ]. In: Pro- ceeding of the 30th Annual International ACM SIGIR Conference, 2007: 223-230.
  • 6Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.
  • 7Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
  • 8Panagiotis Papadimitriou, Ali Dasdan, Hector Garcia-Molina. Web graph similarity for anomaly detection[ J]. Journal of Internet Serv- ices and Applications,2010, 1 ( 1 ) : 19-30.
  • 9张祖平,徐昕,龙军,袁鑫攀.文本相似性度量中参数相关性与优化配置研究[J].小型微型计算机系统,2011,32(5):983-988. 被引量:11

二级参考文献3

共引文献10

同被引文献168

引证文献21

二级引证文献56

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部