摘要
单机计算资源难以满足海量数据的冗余检测,提出基于Spark的海量数据冗余检测方法。先基于Simhash算法实现待测数据与对应指纹间的映射(二进制串),再设计指纹检索树并提出基于指纹检索树的数据冗余检测算法ROFA;最后,设计了基于Spark和ROFA的海量数据的冗余检测策略,实现了海量数据的冗余检测;利用UCI提供的数据进行实例分析,证明了该方法的有效性。
Due to the single computing resources have been unable to effectively complete the redundant detecting for massive data, a massive data redundancy detection method based on Spark is proposed. Firstly, use Simhash algorithm to convert data tuples into corresponding binary strings(fingerprints). Secondly, a fingerprint index tree is designed, and a Data redundancy detection algorithm ROFA based on the fingerprint index tree is proposed;finally, a redundant detection strategy for massive data based on Spark and ROFA is designed, which can detect redundant for massive data, and a comparative experimental analysis about data from UCI is taken to shows that the presented method is efficient and accurate.
出处
《科学技术创新》
2020年第16期91-93,共3页
Scientific and Technological Innovation
基金
广西电网公司科技项目资助(项目编号:GXKJXM20180828,项目名称:互联网资产排查及安全感知平台研究与应用)~~。