摘要
相似文档搜索指检索与给定查询文档相似的文档,在大数据处理中具有广泛的应用,如近似网页检测、新闻报道聚合以及抄袭检测等。为实现海量相似文档的快速搜索,可采用Simhash指纹方法将文档映射成二进制指纹,以海明距离表达文档相似度,并通过指纹分段建立索引提高计算效率。针对传统方法在指纹分段过程中大量冗余计算影响到计算效率的问题,提出了基于顺序匹配的候选集筛选方法,以减少指纹相似性计算量和网络带宽消耗,实现快速搜索。试验表明,该方法具有较好的性能和可扩展性。
A similar document search is to find similar documents for a query document. It is widely used in the big data processing, such as near-duplicate webpage detection, related news aggregation and plagiarism detection. To search massive similar document efficiently, the Sire- hash fingerprint method is applied for projecting the document to the compact binary code. The Hamming distance represents the document similarity. The fingerprint is partitioned into sub- codes as an index to accelerate computing performance. The candidate set filtering method based on the sequence matching is used to reduce the capacity of fingerprint similarity computation and the network bandwidth usage for fast search. Experimental results show that the method achieves high performance and good flexibility.
出处
《指挥信息系统与技术》
2015年第2期61-65,共5页
Command Information System and Technology
基金
软件新技术与产业化协同创新中心部分资助项目