摘要
近似文本检测已成为当前研究热点.基于SimHash指纹的近似文本检测是主流的检测方法之一.但使用SimHash进行近似文本检测存在如下问题:指纹位数单一,丢失了一定量的信息.针对该问题,为使SimHash指纹尽可能多地代表文档的内容或特征,通过对术语集的统计特征分析,提出基于多SimHash指纹和k维超曲面的近似文本检测算法.实验表明基于多Sim-Hash指纹的近似文本检测算法提高了检测的准确率,而且所增加的时间代价很小.
Near-duplicate detection has attracted significant attention over the past years. The near-duplicate detection based on Sim-Hash is one of the state-of-the-art algorithms. However, there exists a problem for this method: SimHash maps high-dimensional vectors to small-sized and well formatted (fixed length) fingerprints, which lost a certain amount of information. To solve the problem, this paper firstly introduces the analyses of statistical characteristics of term sets. Then a novel and efficient near-duplicate detection scheme based on multiple SimHash fingerprints and k-dimensional hypersurfaees is presented. Experimental results prove that the scheme can significantly improve the precision and F1, and execution times are almost remained unchanged.
出处
《小型微型计算机系统》
CSCD
北大核心
2011年第11期2152-2157,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60825202
60803079
60921003
61070072)资助
国家科技支撑计划项目(2009BAH51B02)资助
"核高基"国家科技重大专项(2010ZX01045-001-005)资助
长江学者奖励计划项目资助
新世纪优秀人才支持计划项目(NECT-08-0433)资助
IBM Research China University Relation Program资助