基于Simhash算法的海量文档反作弊技术研究被引量：7

Research on Huge Amounts of Documents Anti-spamming Technique Based on Simhash Algorithm

下载PDF

导出

摘要以互联网重复文档反作弊需求为背景,研究了基于Simhash的海量文档反作弊技术。以Simhash算法为文档判重的核心算法作基础对该算法获取文档特征的过程进行改进,将单词意义作为衡量单词权重的一个考量因素。针对64位文档Simhash签名,提供用户维度、全文维度和黑库维度的文档判重服务,并可基于全文和段落两种粒度进行文档相似性比较。通过测试数据和分析,该技术能保证运行稳定,每个实例可存储1亿文档,平均请求耗时稳定在20 ms左右,高峰期请求耗时会增长,但一般不会超过100 ms。 On the background of the anti-spamming needs of repeated documents in Intemet, research the anti-spamming technique based on the Simhash on huge amounts of documents. On the basis of taking the Simhash algorithm as core algorithm in duplicate document detection, improve the procedure of achieving document features of this algorithm. It takes the meaning of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit, provide the document service of user dimension, the full dimension and black dimension,and make a similarity comparison based on the full text and paragraphs. Through test data and analysis,this technique can guarantee the stable operation, 100 million documents can be memorized in each example. The average request response time is about 20 ms. The response time will increase during the peak hour,but,in general,will not go over 100 ms.

作者徐济惠

机构地区宁波城市学院

出处《计算机技术与发展》 2014年第9期103-107,共5页 Computer Technology and Development

基金宁波市自然科学基金资助项目(2011A610100)

关键词重复文本检测 Simhash 反作弊签名计算 duplicate document detection Simhash anti-spamming signature calculation

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献6

1刘件,魏程.中文分词算法研究[J].微计算机应用,2008,29(8):11-16. 被引量：25
2高凯,王永成,肖君.网页去重策略[J].上海交通大学学报,2006,40(5):775-777. 被引量：13
3张祖平,徐昕,龙军,袁鑫攀.文本相似性度量中参数相关性与优化配置研究[J].小型微型计算机系统,2011,32(5):983-988. 被引量：11
4董博,郑庆华,宋凯磊,田锋,马瑞.基于多SimHash指纹的近似文本检测[J].小型微型计算机系统,2011,32(11):2152-2157. 被引量：21
5龙树全,赵正文,唐华.中文分词算法概述[J].电脑知识与技术,2009,5(4):2605-2607. 被引量：39
6郭双宙,梁金兰.构件库用户反馈子系统的客观反馈的设计[J].计算机技术与发展,2007,17(5):129-132. 被引量：2

二级参考文献30

1潘颖,刘洋,谢冰,杨芙清.支持管理在线构件的基本构件描述模型[J].电子学报,2003,31(z1):2110-2114. 被引量：7
2张自然,金燕.知识检索与信息检索的检索效率比较[J].情报科学,2005,23(4):590-593. 被引量：10
3顾铮,顾平.信息抽取技术在中医研究中的应用[J].医学信息（西安上半月）,2007,20(1):27-30. 被引量：11
4易丽萍,叶水生,吴喜兰.一种改进的汉语分词算法[J].计算机与现代化,2007(2):13-15. 被引量：2
5Chien Lee - Feng. PAT - tree - based adaptive keyphrase extraction for intelligent Chinese information retrieval. Information Processing and Management, 1999,35 : 501 - 521.
6Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
7Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
8Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
9Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
10Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.

共引文献101

1曹传东,郭理.一种基于文本抽取的网页正文去重算法[J].科技信息,2009(1):102-103. 被引量：1
2曾铭,俞俊生,刘绍华.一种用于社交网站的云安全敏感信息过滤模型[J].华中科技大学学报（自然科学版）,2012,40(S1):211-214. 被引量：4
3沙芸,张国英,孟凡亮.基于关键词提取的娱乐新闻文档去重算法[J].广西师范大学学报（自然科学版）,2007,25(2):30-33. 被引量：3
4阮进,袁景瑞,梁循.互联网金融新闻搜索的文本消重方法研究[J].西华大学学报（自然科学版）,2008,27(2):1-3.
5赵远东,陈康,陈建华.基于全文检索的Segmenter分词算法改进[J].电脑知识与技术,2009,5(1):202-205.
6宋国柱,陈俊杰.基于双字词的动态最大匹配分词算法的研究[J].太原科技大学学报,2009,30(3):199-202. 被引量：1
7张兢,候旭东,吕和胜.基于朴素贝叶斯和支持向量机的短信智能分析系统设计[J].重庆理工大学学报（自然科学）,2010,24(1):77-80. 被引量：18
8温云辉.多关键词查找相关产品的一种实现[J].黎明职业大学学报,2009(4):26-30.
9申兵一,巩青歌.中文分词技术在搜索引擎中的应用研究[J].计算机与网络,2010,36(1):60-63. 被引量：2
10李玉红,柴林燕,张琪.结合分词技术与语句相似度的主观题自动判分算法[J].计算机工程与设计,2010,31(11):2663-2666. 被引量：9

同被引文献62

1刘云峰,齐欢,Xiang’en Hu,Zhiqiang Cai.潜在语义分析权重计算的改进[J].中文信息学报,2005,19(6):64-69. 被引量：19
2陈秀真,郑庆华,管晓宏,林晨光.层次化网络安全威胁态势量化评估方法[J].软件学报,2006,17(4):885-897. 被引量：342
3Govindaraju V, Ramanathan K. Similar document search and recommendation[J]. Journal of Emerging Technologies in Web Intelligence, 2012,4 ( 1 ) : 84-93.
4Dasdan A,D'Alberto P, Kolay S, et al. Automatic re- trieval of similar content using search engine query in- terfaee[C]//Proeeedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong : ACM, 2009 : 701-710.
5Pereira A, Ziviani N. Retrieving similar documents from the Web[J]. Journal of Web Engineering,2004,2 (4) :247-261.
6Charikar M. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the 34th An- nual ACM Symposium on Theory of Computing.Montreal : ACM, 2002 : 380-388.
7Manku G,Jain A, Sarma A D. Detecting near-dupli- cates for Web crawling[C]//Proceedings of the 16th International Conference on World Wide Web. Banff: ACM, 2007: 141-149.
8Papadimitriou P, Garcia-Molina H, Dasdan A. Web graph similarity for anomaly detection[J]. Journal of Internet Services and Applications, 2010,1 (1) : 19-30.
9Uddin M S,Roy C K,Schneider K A,et al. On the ef- fectiveness of simhash for detecting near-miss clones in larger scale software systems[C]//Proceedings of the 18th Working Conference on Reverse Engineering (WCRE). Lero : IEEE, 2011 : 13-22.
10Williams K,Wu J, Giles C L. SimSeerX: a similar document search engine[C]//Proceedings of the 2014 ACM Symposium on Document Engineering. Fort Collins : ACM, 2014 : 143-146.

引证文献7

1张广庆,葛唯益,贺成龙.基于Simhash的海量相似文档快速搜索优化方法[J].指挥信息系统与技术,2015,6(2):61-65. 被引量：7
2石红姣.基于改进随机决策树算法的分布式数据挖掘[J].计算机与数字工程,2017,45(9):1802-1808. 被引量：5
3周晟劼,袁骏毅,侯晋.基于Simhash算法的自助胶片打印系统设计与实现[J].中国数字医学,2018,13(7):61-62.
4张朋,杨鹤标.基于Activiti的教学过程控制系统设计与实现[J].软件导刊,2018,17(10):102-105.
5崔彤彤,崔荣一.基于潜在语义分析的文本指纹提取方法[J].中文信息学报,2018,32(5):74-79. 被引量：8
6顾志祥,谢龙恩,杜雨.文本相似度计算的Simhash算法的实现与改进[J].信息通信,2020,0(1):27-29. 被引量：5
7童伟传.基于SimHash算法的大数据网络安全态势的评估[J].机械设计与制造工程,2022,51(5):125-129. 被引量：3

二级引证文献27

1杨达森,丛颖男.基于文本的我国可持续发展信息披露趋势分析[J].中国注册会计师,2024(6):21-30.
2任民山,蔡红霞.基于Simhash算法的海量文本相似性检测方法研究[J].计量与测试技术,2018,45(4):78-80. 被引量：3
3高玉平.海量图书检索信息的快速查询系统优化设计研究[J].现代电子技术,2017,40(6):5-9. 被引量：9
4晋晓琳,张树武,刘杰.基于分布式架构的海量文本快速相似度检测研究[J].中国传媒大学学报（自然科学版）,2019,26(1):39-44.
5许冠军.基于激光图像分析的残缺指纹提取技术[J].激光杂志,2019,40(4):78-82. 被引量：1
6李正,咸容禹,余前佳,陈卉,吴玉龙.基于版式电子文档的全文检索技术在自然资源部机关政务办公系统中的应用初探[J].国土资源信息化,2019(2):22-26. 被引量：1
7魏芳芳,魏顺平,睢世杰.基于Moodle学习平台的发帖重复记录检测技术研究[J].天津电大学报,2019,23(2):1-5. 被引量：1
8袁培森,杨承林,宋玉红,翟肇裕,徐焕良.基于Stacking集成学习的水稻表型组学实体分类研究[J].农业机械学报,2019,50(11):144-152. 被引量：22
9顾志祥,谢龙恩,杜雨.文本相似度计算的Simhash算法的实现与改进[J].信息通信,2020,0(1):27-29. 被引量：5
10范英铭.基于数据挖掘的机电故障数据集离群点检测算法[J].新一代信息技术,2019,2(22):53-59.

1曹海傧,朱明,冯伟国.一种快速有效的海量视频拷贝检测方法[J].小型微型计算机系统,2014,35(5):1160-1163. 被引量：1
2池水明,阚歆炜,张旻.基于Simhash的SQL注入漏洞检测技术研究[J].计算机时代,2014(3):3-5. 被引量：3
3杜红刚,吴岳忠.基于云存储的网络文档共享系统[J].湖南工业大学学报,2015,29(5):72-76. 被引量：1
4疾速滚轮浏览“无限” 全新罗技无线激光鼠标MX620[J].电脑迷,2007,0(14):21-21.
5楚敏南,罗新高,白煜华.一种基于SimHash的海量视频检索方法[J].科技与创新,2015,0(18):9-11. 被引量：1
6栗迎结,任洪敏.基于Selenium的SQL注入漏洞检测系统的研究[J].现代计算机,2016,22(14):20-24. 被引量：2
7张敏.海量数据的MapReduce相似度检测[J].实验室研究与探索,2014,33(9):132-136. 被引量：4
8余意,张玉柱,胡自健.基于Simhash算法的大规模文档去重技术研究[J].信息通信,2015,28(2):28-29. 被引量：12
9周龙泉,卫文学.基于主成分分析与Simhash的入侵检测方法[J].计算机与数字工程,2015,43(7):1291-1294. 被引量：3
10大河蟹.快速扫描找出重复文档[J].网友世界,2009(16):39-39.

计算机技术与发展

2014年第9期

浏览历史

内容加载中请稍等...

基于Simhash算法的海量文档反作弊技术研究被引量：7

参考文献6

二级参考文献30

共引文献101

同被引文献62

引证文献7

二级引证文献27

相关作者

相关机构

相关主题

浏览历史

基于Simhash算法的海量文档反作弊技术研究 被引量：7

参考文献6

二级参考文献30

共引文献101

同被引文献62

引证文献7

二级引证文献27

相关作者

相关机构

相关主题

浏览历史

基于Simhash算法的海量文档反作弊技术研究被引量：7