摘要
以互联网重复文档反作弊需求为背景,研究了基于Simhash的海量文档反作弊技术。以Simhash算法为文档判重的核心算法作基础对该算法获取文档特征的过程进行改进,将单词意义作为衡量单词权重的一个考量因素。针对64位文档Simhash签名,提供用户维度、全文维度和黑库维度的文档判重服务,并可基于全文和段落两种粒度进行文档相似性比较。通过测试数据和分析,该技术能保证运行稳定,每个实例可存储1亿文档,平均请求耗时稳定在20 ms左右,高峰期请求耗时会增长,但一般不会超过100 ms。
On the background of the anti-spamming needs of repeated documents in Intemet, research the anti-spamming technique based on the Simhash on huge amounts of documents. On the basis of taking the Simhash algorithm as core algorithm in duplicate document detection, improve the procedure of achieving document features of this algorithm. It takes the meaning of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit, provide the document service of user dimension, the full dimension and black dimension,and make a similarity comparison based on the full text and paragraphs. Through test data and analysis,this technique can guarantee the stable operation, 100 million documents can be memorized in each example. The average request response time is about 20 ms. The response time will increase during the peak hour,but,in general,will not go over 100 ms.
出处
《计算机技术与发展》
2014年第9期103-107,共5页
Computer Technology and Development
基金
宁波市自然科学基金资助项目(2011A610100)
关键词
重复文本检测
Simhash
反作弊
签名计算
duplicate document detection
Simhash
anti-spamming
signature calculation