期刊文献+

基于Simhash算法的海量文档反作弊技术研究 被引量:7

Research on Huge Amounts of Documents Anti-spamming Technique Based on Simhash Algorithm
下载PDF
导出
摘要 以互联网重复文档反作弊需求为背景,研究了基于Simhash的海量文档反作弊技术。以Simhash算法为文档判重的核心算法作基础对该算法获取文档特征的过程进行改进,将单词意义作为衡量单词权重的一个考量因素。针对64位文档Simhash签名,提供用户维度、全文维度和黑库维度的文档判重服务,并可基于全文和段落两种粒度进行文档相似性比较。通过测试数据和分析,该技术能保证运行稳定,每个实例可存储1亿文档,平均请求耗时稳定在20 ms左右,高峰期请求耗时会增长,但一般不会超过100 ms。 On the background of the anti-spamming needs of repeated documents in Intemet, research the anti-spamming technique based on the Simhash on huge amounts of documents. On the basis of taking the Simhash algorithm as core algorithm in duplicate document detection, improve the procedure of achieving document features of this algorithm. It takes the meaning of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit, provide the document service of user dimension, the full dimension and black dimension,and make a similarity comparison based on the full text and paragraphs. Through test data and analysis,this technique can guarantee the stable operation, 100 million documents can be memorized in each example. The average request response time is about 20 ms. The response time will increase during the peak hour,but,in general,will not go over 100 ms.
作者 徐济惠
机构地区 宁波城市学院
出处 《计算机技术与发展》 2014年第9期103-107,共5页 Computer Technology and Development
基金 宁波市自然科学基金资助项目(2011A610100)
关键词 重复文本检测 Simhash 反作弊 签名计算 duplicate document detection Simhash anti-spamming signature calculation
  • 相关文献

参考文献6

二级参考文献30

  • 1潘颖,刘洋,谢冰,杨芙清.支持管理在线构件的基本构件描述模型[J].电子学报,2003,31(z1):2110-2114. 被引量:7
  • 2张自然,金燕.知识检索与信息检索的检索效率比较[J].情报科学,2005,23(4):590-593. 被引量:10
  • 3顾铮,顾平.信息抽取技术在中医研究中的应用[J].医学信息(西安上半月),2007,20(1):27-30. 被引量:11
  • 4易丽萍,叶水生,吴喜兰.一种改进的汉语分词算法[J].计算机与现代化,2007(2):13-15. 被引量:2
  • 5Chien Lee - Feng. PAT - tree - based adaptive keyphrase extraction for intelligent Chinese information retrieval. Information Processing and Management, 1999,35 : 501 - 521.
  • 6Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
  • 7Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
  • 8Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
  • 9Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
  • 10Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.

共引文献101

同被引文献62

  • 1刘云峰,齐欢,Xiang’en Hu,Zhiqiang Cai.潜在语义分析权重计算的改进[J].中文信息学报,2005,19(6):64-69. 被引量:19
  • 2陈秀真,郑庆华,管晓宏,林晨光.层次化网络安全威胁态势量化评估方法[J].软件学报,2006,17(4):885-897. 被引量:342
  • 3Govindaraju V, Ramanathan K. Similar document search and recommendation[J]. Journal of Emerging Technologies in Web Intelligence, 2012,4 ( 1 ) : 84-93.
  • 4Dasdan A,D'Alberto P, Kolay S, et al. Automatic re- trieval of similar content using search engine query in- terfaee[C]//Proeeedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong : ACM, 2009 : 701-710.
  • 5Pereira A, Ziviani N. Retrieving similar documents from the Web[J]. Journal of Web Engineering,2004,2 (4) :247-261.
  • 6Charikar M. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the 34th An- nual ACM Symposium on Theory of Computing.Montreal : ACM, 2002 : 380-388.
  • 7Manku G,Jain A, Sarma A D. Detecting near-dupli- cates for Web crawling[C]//Proceedings of the 16th International Conference on World Wide Web. Banff: ACM, 2007: 141-149.
  • 8Papadimitriou P, Garcia-Molina H, Dasdan A. Web graph similarity for anomaly detection[J]. Journal of Internet Services and Applications, 2010,1 (1) : 19-30.
  • 9Uddin M S,Roy C K,Schneider K A,et al. On the ef- fectiveness of simhash for detecting near-miss clones in larger scale software systems[C]//Proceedings of the 18th Working Conference on Reverse Engineering (WCRE). Lero : IEEE, 2011 : 13-22.
  • 10Williams K,Wu J, Giles C L. SimSeerX: a similar document search engine[C]//Proceedings of the 2014 ACM Symposium on Document Engineering. Fort Collins : ACM, 2014 : 143-146.

引证文献7

二级引证文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部