期刊文献+

一种基于Simhash算法的重复域名数据去重方法 被引量:4

Method for deleting duplicate domain name data based on Simhash algorithm
下载PDF
导出
摘要 随着数字科学技术的发展,各领域需要传输和存储的数据量急剧上升。然而传输和存储的数据中重复数量占据了很大的比例,这不仅会增加使用数据的成本,也会影响处理数据的效率。域名是一种存储量大而且对处理速率有极高要求的数据,为了节约域名解析系统的存储成本,提高传输效率,本文在原有数据去重技术的基础上,引入了Simhash算法,结合域名数据的结构特征,改进数据分词和指纹值计算方式,提出了一种基于Simhash算法的重复域名数据去重方法。实验结果表明,相比于传统的数据去重技术,该方法对删除重复域名数据效率更高,具有较好的实际应用价值。 With the development of digital science and technology,the amount of data that needs to be transmitted and stored in various fields has risen sharply.However,the number of repetitions in these data occupies a large proportion.This not only increases the cost of using data,but also reduces the efficiency of data processing.Domain name is a kind of data with large storage capacity and extremely high requirements for processing speed.In order to save storage cost and improve transmission efficiency,this paper proposes a method for deleting duplicate domain name data based on Simhash algorithm.Compared with the traditional data deduplication technology,this method combines the structural characteristics of the domain name data,and introduces the Simhash algorithm to design a deduplication method for the domain name data.The experimental results show that compared with the traditional data deduplication technology,this method is more efficient in deleting duplicate domain name data and has better practical application value.
作者 侯开茂 韩庆敏 吴云峰 黄兵 张久发 柴处处 Hou Kaimao;Han Qingmin;Wu Yunfeng;Huang Bing;Zhang Jiufa;Chai Chuchu(The 6th Research Institute of China Electronics Corporation,Beijing 100083,China)
出处 《信息技术与网络安全》 2022年第4期71-76,共6页 Information Technology and Network Security
关键词 数据去重 域名 Simhash 数据分块 data deduplication domain name Simhash data block
  • 相关文献

参考文献5

二级参考文献63

  • 1王海峰,夏洪雷,孙冰.基于程序行为特征的病毒检测技术与应用[J].计算机系统应用,2006,15(5):29-31. 被引量:6
  • 2Bhagwat D,Pollack K,Long DDE,Schwarz T,Miller EL,P-ris JF.Providing high reliability in a minimum redundancy archival storage system.In:Proc.of the 14th Int'l Symp.on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems (MASCOTS 2006).Washington:IEEE Computer Society Press,2006.413-421.
  • 3Zhu B,Li K.Avoiding the disk bottleneck in the data domain deduplication file system.In:Proc.of the 6th Usenix Conf.on File and Storage Technologies (FAST 2008).Berkeley:USENIX Association,2008.269-282.
  • 4Bhagwat D,Eshghi K,Mehra P.Content-Based document routing and index partitioning for scalable similarity-based searches in a large corpus.In:Berkhin P,Caruana R,Wu XD,Gaffney S,eds.Proc.of the 13th ACM SIGKDD Int'l Conf.on Knowledge Discovery and Data Mining (KDD 2007).New York:ACM Press,2007.105-112.
  • 5You LL,Pollack KT,Long DDE.Deep store:An archival storage system architecture.In:Proc.of the 21st Int'l Conf.on Data Engineering (ICDE 2005).Washington:IEEE Computer Society Press,2005.804-815.
  • 6Quinlan S,Dorward S.Venti:A new approach to archival storage.In:Proc.of the 1st Usenix Conf.on File and Storage Technologies (FAST 2002).Berkeley:USENIX Association,2002.89-102.
  • 7Sapuntzakis CP,Chandra R,Pfaff B,Chow J,Lam MS,Rosenblum M.Optimizing the migration of virtual computers.In:Proc.of the 5th Symp.on Operating Systems Design and Implementation (OSDI 2002).New York:ACM Press,2002.377-390.
  • 8Rabin MO.Fingerprinting by random polynomials.Technical Report,CRCT TR-15-81,Harvard University,1981.
  • 9Rivest R.The MD5 message-digest algorithm.1992.http://www.python.org/doc/current/lib/module-md5.html.
  • 10U.S.National Institute of Standards and Technology (NIST).Federal Information Processing Standards (FIPS) Publication 180-1:Secure Hash Standard.1995.http://www.itl.nist.gov/fipspubs/fip180-1.htm.

共引文献157

同被引文献33

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部