期刊文献+

基于Spark的海量数据冗余检测方法 被引量:1

An Redundant Date Detecting Methods for Massive Data Based on Spark
下载PDF
导出
摘要 单机计算资源难以满足海量数据的冗余检测,提出基于Spark的海量数据冗余检测方法。先基于Simhash算法实现待测数据与对应指纹间的映射(二进制串),再设计指纹检索树并提出基于指纹检索树的数据冗余检测算法ROFA;最后,设计了基于Spark和ROFA的海量数据的冗余检测策略,实现了海量数据的冗余检测;利用UCI提供的数据进行实例分析,证明了该方法的有效性。 Due to the single computing resources have been unable to effectively complete the redundant detecting for massive data, a massive data redundancy detection method based on Spark is proposed. Firstly, use Simhash algorithm to convert data tuples into corresponding binary strings(fingerprints). Secondly, a fingerprint index tree is designed, and a Data redundancy detection algorithm ROFA based on the fingerprint index tree is proposed;finally, a redundant detection strategy for massive data based on Spark and ROFA is designed, which can detect redundant for massive data, and a comparative experimental analysis about data from UCI is taken to shows that the presented method is efficient and accurate.
出处 《科学技术创新》 2020年第16期91-93,共3页 Scientific and Technological Innovation
基金 广西电网公司科技项目资助(项目编号:GXKJXM20180828,项目名称:互联网资产排查及安全感知平台研究与应用)~~。
关键词 海量数据 冗余检测 Simhash SPARK Massive data redundant detection Simhash Spark
  • 相关文献

参考文献1

二级参考文献9

共引文献40

同被引文献11

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部